Five important flaws of GCSE oral tests

Research has highlighted a number of issues with oral testing which examination boards and teachers need to heed, as they can have important implications not just for the way GCSE syllabi are designed, but also for the conduct and the assessment of oral GCSE exams as well as for our teaching. The issues which I will highlight are, I suspect, generalizable to either types of oral tests conducted in other educational systems, especially when it comes to the reliability of the assessment procedures and the authenticity of the tasks adopted. They are important issues, as they bring into question the fairness and objectivity of the tests as well as whether we are truly preparing children for real L2-native-like communication.

Issue n.1 – Instant or delayed assessment?

A study by Hurman (1996), cited in Macaro (2007) investigated to what extent examiners’ assessment of content and accuracy of candidates’ responses to GCSE role-play affected tests. Hurman took 60 experiences examiners and divided them into groups; one spent some time before awarding the mark and one did it instantaneously. Hurman’s findings indicate that waiting a few seconds before awarding the mark seem to result in more objective grading. This, in my view, is caused by the divided attention that listening and focusing on assessment causes – I have experienced this first-hand many times!

This has important implications for teachers working with certain examination boards. Cambridge International Examinations board, for instance, prescribes that, at IGCSE, the teacher/examiner award the mark instantaneously and explicitly forbids the practice of grading the candidates retrospectively or after listening to the recording. If Hurman’s findings were to be true of the vast majority of examiners, examination boards like CIE may have to change their regulations and allow for marking to be done retrospectively when the examiner’s attention is not divided between listening to the candidate’s response to a new question, when still marking the previous one – an ominous task!

Issue n.2 – What does complexity of structures/language mean?

This is another crucial issue, which I found year in year out when moderating oral GCSE/IGCSE candidate’s oral performance during my teaching career. Teachers listening to the same recording usually tend to agree when it comes to complexity of vocabulary but not necessarily when it comes to complexity of grammar/syntactic structures. Chambers and Richards’ (1992) findings indicate that this is not simply my experience; their evidence suggests that there was a high level of disagreement amongst the teachers involved in their study as to what constituted ‘complexity of structures’. They also found that the teachers disagreed also in terms of what was meant by ‘fluency’ and ‘use of idiom’ – another issue that I have experienced myself when moderating.

To further complicate the picture, there is, in my view, another issue which research should probe into, and I invite colleagues who work with teachers of nationalities to investigate; the fact, that is, that L1-target-language-speaker raters tend to be stricter than L2-target-language-speaker ones. This issue is particularly serious in light of Issue n.5 below.

Issue n. 3 – Are the typical GCSE oral tasks ‘authentic’?

I often play a prank on my French colleague Ronan Jezequel , by starting a conversation about the week-end just gone by asking question in a GCSE-like style and sequence until, after a few seconds, he realizes that there is something wrong and looks at me funny… Are we testing our students on (and preparing them for) tasks that do not reflect authentic L2 native speaker speech? This is what another study by Chambers (1995) set out to investigate. They examined 28 tapes of French GCSE candidates and compared them to conversations on the same themes by 25 French native speakers in the same age bracket. They found that not only, as easily predictable, the native speakers used more words (437 vs 118) and more clauses (56.9 vs 23.9), but also that:

The French speakers found the topic house/flat description socially unacceptable;
The native speakers found the topic ‘Daily routine’ them unauthentic and – interestingly – produced very few reflexive verbs
The native speakers used the near future whilst the non-natives used the simple future
The native speakers used the imperfect tense much more than the non-natives
The non-native speakers used relative causes much less than the French

Are these tests, as the researchers concluded, testing students’ ability to converse with native speakers or their acquisition of grammar?

Issue n.4 – The grammar accuracy bias

A number of studies (e.g. Alderson and Banerjee, 2002) have found time and again that assessors’ perception of grammar accuracy seem to bias examiners, regardless of how much weight is given in the assessment specification on effective communication. This issue will be exacerbated or mitigated depending on the examiners’ view of what linguistic proficiency means and by their degree of tolerance of errors; whereas a teacher might find a learner’s communicatively effective use of compensation strategies (e.g. approximation or coinage) a positive thing even though it leads to grammatically flawed utterances, another might find it unacceptable.

Here again, background differences are at play. Mistakes that to a native speaker might appear as stigmatizing or very serious might seem mild or not even be considered as mistakes at all…

Issue n.5 – Inter-rater reliability

This is the biggest problem of all and it is related to Issue n.2, above; how reliable are the assessment procedures? Many years of research have shown that for any multi-trait assessment scale to be effective it needs to be extensively piloted. Moreover, whenever it is used for assessment, two or more co-raters must agree on the scores and, where there is disagreement, they must discuss the discrepancies until agreement is reached. However, when the Internal moderator and the External one, in cases where the recording is sent to the Examination board for the assessment, do not agree…what happen to the discussion that is supposed to take place to reach a common agreement?

Another important issue relates to the multi-traits assessment scales used. First of all they are too vague. This is convenient, because the vaguer they are the easier it is to ‘fiddle’ with the numbers. However, the vagueness of a scale makes it difficult to discriminate between performances when the variation in ability is not that great as it happens in a top set class, for example, with A and A* students. In these cases, in order to discriminate effectively between an 87 and a 90% which could mean getting or not an A*, research shows clearly that the best assessment to be used should contain more than the two or three traits (categories) usually found in GCSE scales (or even A Level, for that matter) and, more importantly, should be more fine-grained (i.e. each category should have more criterion-referenced grades). This would hold examination boards much more accountable, but would require more financial investment and work, I guess, on their part.