Why the reliability of UK Examination Boards’ assessment of A Level writing papers is questionable

August 26, 2015May 31, 2016 Gianfranco Conti, Phd (Applied Linguistics), MA (TEFL), MA (English Lit.), PGCE (Modern Languages and P.E.) Assessment, L2 writing, Uncategorized, Writing skills

The Language Gym

Often, our year 12 or Year 13 students who have consistently scored high in mock exams or other assessment in the writing component of the A Level exam paper, do significantly less well in the actual exam. And, when the teachers and/or students, in disbelief, apply for a remark, they often see the controversial original grade reconfirmed or, as it has actually happened to two of my students in the past, even lowered. In the last two years, colleagues of mine around the world have seen this phenomenon worsen: are UK examinations boards becoming harsher or stricter in their grading? Or is it that the essay papers are becoming more complicated? Or, could it be that the current students are generally less able than the previous cohorts?

Although I do not discount any of the above hypotheses, I personally believe that the phenomenon is also partly due to serious issues…

View original post 1,999 more words

Six very common flaws of foreign language assessment

August 19, 2015September 15, 2021 Gianfranco Conti, Phd (Applied Linguistics), MA (TEFL), MA (English Lit.), PGCE (Modern Languages and P.E.) Assessment, Uncategorized

Teaching – Assessment mismatch

Often foreign language instructors test students on competences that have not been adequately emphasized in their teaching or, in some cases, have not even been taught.

The most common example of this refers to the issue of task unfamiliarity, i.e. the use of an assessment tool or language task the students have never or rarely carried out prior to the test. This can be an issue, as research clearly shows that the extent to which a learner is familiar with a task will affect his/her performance. The reasons for this refer to the anxiety that the unfamiliarity engenders, the higher cognitive load that grappling with an unfamiliar task obviously poses on working memory (especially when the task is quite complex) and of course the fact that memory (and, consequently, task-knowledge) is context-dependent – which means that knowledge is not easily transferred from task to task (the so-called T.A.P. or Transfer Appropriate Processing principle). By doing a task over and over again prior to an assessment involving that task, the student develops task-related cognitive and metacognitive strategies which ease the cognitive load and facilitate its execution.

Another common scenario is when students are not explicitly focused on and provided sufficient practice in a given area of language proficiency (e.g. accuracy, fluency, vocabulary range, grammar complexity); yet their teachers use assessment scales which emphasize performance in that area (e.g. by given grammatical accuracy a high weighting in speaking performance whilst practicing grammar only through cloze tasks). I have had several colleagues in the past who taught their students through project-based work involving little speaking practice even though they knew that the students would be assessed in terms of fluency at the end of the unit. Bizarre!

Language unfamiliarity is another instance of this mismatch, in my opinion. This refers to administering to students a test which requires them to infer from context or even use unfamiliar words and results in assessing the learners not on the language learnt during the unit but on compensation strategies (e.g. guessing words from context). Although compensation strategies are indeed a very important component of autonomous competence, I do believe that a test needs to assess students only on what they have been taught and not on their adaptive skills – or the assessment might be perceived by the learners as unfair, with negative consequence for student self-efficacy and motivation. A test must have construct validity, i.e. it must assess what it sets out to assess. Hence, unless we explicitly provide extensive practice in inferential skills, we should not test students on them.

Some teachers feel that since the students should possess the knowledge of the language required by the task whether the students are familiar with the task or not will not matter; this assumption, however, is based on a misunderstanding of L2 language acquisition and task-related proficiency.

Knowledge vs control

Very often teachers administer ‘grammar’ tests in order to ascertain whether a specific grammar structure has been ‘learnt’. This is often done through gap-fill/cloze tests or translations. This approach to grammar testing is correct if one is purporting to assess declarative (intellectual) knowledge of the target structure(s) but not the extent of the learners’ control over it (i.e. the ability to use grammar in real operating conditions, in relatively unmonitored speech or written output). An oral picture task or spontaneous conversational exchange eliciting the use of the target structure would be more accurate ways to assess the extent of learner control over grammar and vocabulary. This is another common instance of construct invalidity.

Listening vs Listenership

This refers less to a mistake in assessment design than to a pedagogical flaw and assessment deficit and is a very important issue because of its major wash-back effect on learning. Listening is usually assessed solely through listening comprehension tasks; however, this does not test an important set of listening skills, ‘listenership’, i.e. the ability to respond to an interlocutor (a speaker) in real conversation. If we only test students on this aspect of listening, the grade or level we assign to them will only be reflecting an important set of listening skills (comprehending a text) but not the one they need the most in real-life interaction (listening to an interlocutor as part of meaning negotiation). Listening assessments need to address this important deficit, which, in my opinion is widespread in the UK system.

Lack of piloting

To administer a test without piloting it can be very ‘tricky’ even if the test comes from a widely used textbook assessment pack. Ambiguous pictures and solutions, speed of delivery, inconsistent and/or very subjective grading of tests and construct validity issues are not uncommon flaws of many renowned course-books’ assessment materials. Ideally, tests should be piloted by more than one person on the team, especially when it comes to the grading system; in my experience this is usually the most controversial aspect of an assessment.

‘Woolly’ assessment scales

When you have a fairly homogenous student population, it is important to use assessment scales/rubrics which are as detailed as possible in terms of complexity, accuracy, fluency, communication and range of vocabulary. In this respect, the old UK National Curriculum Levels (still in use in many British schools) were highly defective and so are the GCSE scales adopted by UK examination boards. MFL departments should invest some quality time to come up with their own scales, making specific reference in the grade descriptors to the traits they emphasize the most in their curriculum (so as to satisfy the construct validity criterion).

Fluency – the neglected factor

Just like ‘listenership’, fluency is another factor of language performance that is often neglected in assessment; yet, it is the most important indicator of the level of control someone has achieved in TL receptive and productive skills. Whereas in speaking UK MFL departments do often include fluency amongst their assessment criteria, in writing and reading this is not often the case. Yet, it can be relatively easily done. For instance, in essay writing, all one has to do is to set a time and word limit for the task-in-hand and note down the time of completion for each student as they hand it in. Often teachers do not differentiate between students who score equally across accuracy, complexity and vocabulary but differ substantially in terms of writing fluency (i.e. the time to word ratio). By so doing we fail to assess one of the most important aspects of language acquisition: executive control over a skill. In my view, this is something that should not be overlooked, both in low-stake and high-stake assessments.

Crucial issues in the assessment of speaking and writing (Part 1)

August 10, 2015May 31, 2016 Gianfranco Conti, Phd (Applied Linguistics), MA (TEFL), MA (English Lit.), PGCE (Modern Languages and P.E.) Assessment, Uncategorized

In the last few weeks I have been thinking long and hard about the assessment of the productive skills (speaking and writing), dissatisfied as I am with the proficiency measurement schemes currently in use in many UK school which are either stuck in the former system (National Curriculum Levels) or strongly influenced by it (i.e. mainly tense driven)

However, the challenge of finding a more effective, valid and cost-effective alternative for use in secondary British schools like is no easy task. The biggest obstacles I have found in the process refer to the following questions that have been buzzing in my head for the last few weeks, the answers to which are crucial to the development of any effective approach to the assessment of proficiency in the productive skills

What is meant by ‘fluency’ and how do we measure it?
How do we measure accuracy?
What do we mean by ‘complexity’ of language? How can complexity be measured?
How do we assess vocabulary richness and/or range? How wide should L2 learner vocabulary range at different proficiency stages?
What does it mean to ‘acquire’ a specific grammar structure or lexical item?
When can one say that a specific vocabulary and grammar item has been fully acquired?
What linguistic competences should teacher prioritize in the assessment of learner proficiency? Or should they all be weighted in the same way?
What task-types should be used to assess learners’ speaking and writing proficiency?
How often should we assess speaking and writing?
Should we assess autonomous learning strategy use? If so, how?

All of the above questions refer to constructs commonly used in the multi-traits scales usually adopted by researchers, language education providers and examination boards to assess L2 performance and proficiency. In this post, for reason of space, I will only concern myself with the first three questions reserving to deal with the rest of them in future posts. The issues they refer to are usually acronymized by scholars as CAF (Complexity, Accuracy, Fluency) but I find the acronym FAC (Fluency, Accuracy, Complexity) much more memorable… Thus I will deviate from mainstream Applied Linguistics on this account.

The issues

2.1 What do we mean by ‘fluency’ in speaking and writing? And how do we measure it?

2.1.1 Speaking

Fluency has been defined as ‘the production of language in real time without undue pausing or hesitation’ (Ellis and Barkhuizen 2005: 139) or, in the words of Lennon (1990), as ‘an impression on the listeners that the psycholinguistic process of speech planning and speech production are functioning easily and automatically’. Although many, including teachers, use the term ‘fluency’ as synonymous of competence in oral proficiency, researchers see it more as temporal phenomenon (e.g. how effortlessly and ‘fast’ language is produced). In L2 research Fluency is considered as a different construct to comprehensibility, although from a teacher’s point of view it is obviously desirable that fluent speech be intelligible.

The complexity of the concept of ‘fluency’ stems mainly from its being a multidimensional construct. Fluency is in fact conceptualized as:

Break-down fluency – which relates to how often speakers pause;
Repair fluency – which relates to how often speakers repeat words and self-correct;
Speed fluency – which refers to the rate of speaker delivery.

Researchers have come up with various measures of fluency. The most commonly adopted are:

Speech rate: total number of syllables divided by total time taken to execute the oral task in hand;
Mean length of run: average length of syllables produced in utterances between short pauses;
Phonation/time ratio: time spent speaking divided by the total time taken to execute the oral task;
Articulation rate (rate of sound production) : total number of syllables divided by the time to produce them;
Average length of pauses.

A seminal study by Towell et al (1996) investigated university students of French. The subjects were tested at three points in time: (time one) the beginning of their first year; (time 2) in their second year and (3) after returning from their year abroad (in France). The researchers found that improvements in fluency occurred mainly in terms of speaking rate and mean length of run – the latter being the best indicator of development in fluency. Improvements in fluency were also evidenced by an increase in the rate of sound production (articulation rate), but not in a major way. In their investigation, Towell et al. found that assessing fluency based on pauses is not always a valid procedure because a learner might pause for any of the following reasons:

The demands posed by a specific task;
Difficulty in knowing what to say;
An individual’ personal characteristic;
Difficulty in putting into words an idea already in the brain;
Getting the right balance between length of utterance and the linguistic structure of the utterance.

Hence, the practice of rating students’ fluency based on pauses may not be as valid as many teachers often assume. As Lambert puts it: “although speed and pausing measures might provide an indication of automaticity and efficiency in the speech production process with respect to specific forms, their fluctuation is subject to too many variables to reflect development directly.”

2.1.2 Writing

When it comes to writing, fluency is much more difficult to define. As, Bruton and Kirby (1987) observe,

Written fluency is not easily explained, apparently, even when researchers rely on simple, traditional measures such as composing rate. Yet, when any of these researchers referred to the term fluency, they did so as though the term were already widely understood and not in need of any further explication.

In reviewing the existing literature I was amazed by how much disagreement there is amongst researchers on how to assess writing fluency, which begs the question: if it is such a subjective construct on whose definition nobody agrees, how can the raters appointed by examination boards be relied on to do an objective job?

There are several approaches to assessing writing fluency. The most commonly used in research is composition rate, which is how many words are written per minute. So for instance, in order to assess the development of fluency a teacher may give his/her class a prompt, then stop after a few minutes and ask the students, after giving guidelines on how to carry out the word count, to count the words in their output. This can be done a different moments in time, within a given unit of work or throughout the academic year, in order to map out the development of writing fluency.

Initial implications

Oral fluency is a hugely important dimension of proficiency as it assesses the extent to which speaking skills have been automatized. A highly fluent learner is one who can speak spontaneously and effortlessly, with hardly any hesitation, backtracking and self-correcting.

Assessing, as I have just discussed, is very problematic as there is no international consensus on what constitutes best practice. The Common European Framework of Reference for Languages, which is adopted by many academic and professional institutions around the world provides some useful – but not flawless – guidelines.(http://www.coe.int/t/dg4/education/elp/elp-reg/Source/Key_reference/Overview_CEFRscales_EN.pdf ). MFL department could adapt them to suit their learning context mindful of the main points put across in the previous paragraphs.

The most important implications for teachers are:

Although we do not have to be as rigorous and pedantic as researchers, we may want to be mindful in assessing our students’ fluency of the finding (confirmed by several studies) that more fluent speakers produce longer utterances between short pauses (mean length of run);
However, we should also be mindful of Towell and al.’s (1996) finding that there may be individuals who pause because of other issues not related to fluency but rather to anxiety, working memory issues or other personal traits. It is important in this respect to get to know our students and make sure that we have repeated oral interactions with them so as to get better acquainted with their modus operandi during oral tasks;
In the absence of international consensus on how fluency should be measured, MFL departments may want to decide whether and to what extent frequency of self-repair, pauses and speed should be used in the assessment of their learners’ fluency;
If the GCSE or A level examination adopted by their school does include degrees of fluency as an evaluative criterion– as Edexcel for instance does – then it is imperative for teachers to ask which operationalization of fluency is applied in the evaluation of candidates’ output so as to train students accordingly in preparation for the oral and written exams;
Although comprehensibility is a separate construct to fluency in research, teachers will want their students to speak and write at a speed as close as possible to native speakers’ but also to produce intelligible language. Hence, assessment criteria should combine both constructs.
Regular mini-assessments of writing fluency of the kind outlined above (teacher giving a prompt and students having to write under time conditions) should be conducted regularly, two or three times a term, to map out students’ progress whilst training them to produce language in real operating conditions. If this kind of assessment starts at KS3 or even KS2 (with able groups and ‘easier’ topics), by GCSE and A-levels, it may have a positive washback effect on learner examination performance.

3.Accuracy

Accuracy would seem intuitively as the easiest way to assess language proficiency, but it is not necessarily so. Two common approaches to measuring accuracy involve: (1) calculating the ratio of errors in a text/discourse to number of units of production (e.g. words, clauses, sentences, T units) or (2) working out the proportion of error-free units of production. This is not without problems because it does not tell us much about the type of errors made; this may be crucial in determining the proficiency development of a learner. Imagine Learner 1 who has made ten errors with very advanced structures and Learner 2 who has made ten errors with very basic structures without attempting any of the advanced structures Learner 1 has made mistakes with. To evaluate these two learners’ levels of accuracy as equivalent would be unfair.

Moreover, this system may penalize learners who take a lot of risks in their output with highly challenging structures. So, for instance, an advanced student who tries out a lot of difficult structures (e.g. if –clauses, subjunctives or complex verbal subordination) may score less than someone of equivalent proficiency who ‘plays it safe’ and avoids taking risks. Would that be a fair way of assessing task performance/proficiency? Also, pedagogically speaking, this approach would be counter-productive in encouraging avoidance behavior rather than risk-taking, possibly the most powerful learning strategy ever.

Some scholars propose that errors should be graded in terms of gravity. So, errors that impede comprehension should be considered as more serious than errors which do not. But in terms of accuracy, errors are errors, regardless of their nature. We are dealing with two different constructs here, comprehensibility of output and accuracy of output.

Another problem with using accuracy as a measure of proficiency development is that learner output is compared with native like norms. However, this does not tell us much about the learner’s Interlanguage development; only with what degree of accuracy she/he handles specific language items.

Lambert (2014) reports another important issue pointed out by Bard et al.(1996):

In making grammaticality judgments, raters do not only respond to the grammaticality of sentences, but to other factors which include the estimated frequency with which the structure has been heard, the degree to which an utterance conforms to a prescriptive norm, and the degree to which the structure makes sense to the rater semantically or pragmatically. Such acceptability factors are difficult to separate from grammaticality even for experienced raters.

I am not ashamed to say that I have experienced this myself on several occasions as a rater of GCSE Italian oral exams. And to this day, I find it difficult not to let these three sources of bias skew my judgment.

3.1 Initial implications for teachers and assessment

Grammatical, lexical, phonological and orthographic accuracy are important aspects of proficiency included in all the examination assessment scales. MFL departments ought to collegially decide whether it should play an equally important or more or less important role in assessment than fluency/intelligibility and communication.

Also, once decided what constitute more complex and easier structures amongst the structures the curriculum purports to teach for productive use, teachers may want to choose to focus in assessment mostly or solely on the accuracy of those structures – as this may have a positive washback effect on learning.

MFL teams may also want to discuss to what extent one should assess accuracy in terms of number or types of mistakes or both. And whether mistakes with normally late acquired, more complex structures should be penalized considering that such assessment approach might encourage avoidance behavior.

Complexity

Complexity is the most difficult construct to define and use to assess proficiency because it can refer to different aspects of performance and communication (e.g. lexical, interactional, grammatical, syntactic). For instance, are lexical and syntactic complexity two different aspects of the same performance or two different areas altogether? Some researchers (e.g. Skehan) think so and I tend to agree. So, how should a students’ oral or written performance exhibiting a complex use of vocabulary but a not so complex use of grammar structures or syntax be rated? Should evaluative scales then include two complexity traits, one for vocabulary and one for grammar/syntax? I think so.

Another problem pertains to what we take ‘complex’ to actually mean. Does complex mean…

the number of criteria to be applied in order to arrive at the correct form‘ as Hulstijn and De Graaff (1994) posit? –In other words, how many steps the application of the underlying rule involves? (e.g. perfect sense in French or Italian with verbs requiring the auxiliary ‘to be’)
variety? Meaning, that, in the presence of various alternatives, choosing the appropriate one flexibly and accurately across different contexts would be an index of high proficiency? (this is especially the case with lexis)
cognitively demanding, challenging? Or
acquired late in the acquisition process? (which is not always easy to determine)

All of the above dimensions of complexity pose serious challenges in their conceptualization and objective application to proficiency measurement.

Standard ways of operationalizing language complexity in L2 research have also focused on syntactic complexity, and especially on verbal subordination. In other words, researchers have analyzed L2 learner output by dividing the total number of finite and non-finite clauses by sentential units of analysis such as terminal units, communication units, speech, etc. One of the problems with this is that the number thus obtained is just a figure that tells us that one learner has used more verbal subordination than another but does not differentiate between types of subordination – so, if a learner uses less but more complex subordination than another, s/he will still be rated as using less complex language.

4.1 Implications for teachers

Complexity of learner output is a very desirable quality of learner output and a marker of progress in proficiency, especially when it goes hand in hand with high levels of fluency. However, in the absence of consensus as to what is complex and what is not, MFL departments may want to decide collegially on the criteria amongst the ones suggested above (e.g. variety, cognitive challenge, number of steps required to arrive at the correct form and lateness of acquisition) which they find most suitable for their learning contexts and curricular goals and constraints.

Also, they may want to consider splitting this construct into two strands, vocabulary complexity and grammatical complexity.

Finally, verbal subordination should be considered as a marker of complexity and emphasized with our learners. However, especially with more advanced learners (e.g. AS and A2) it may be useful to agree on what constitute more advanced and less advanced subordination.

In addition, since complexity of language does appear as an evaluative criterion in A-level examination assessment scales, teachers may want to query with the examination boards what complexity stands for and demand a detailed list of which grammar structures are considered as more or less complex.

Conclusions

Fluency, Accuracy and Complexity are very important constructs central to all approaches to the assessment of the two productive macro-skills, speaking and writing. In the absence of international consensus on how to define and measure them, MFL department must come together and discuss assessment philosophies, procedures and strategies to ensure that learner proficiency evaluation is as fair and valid as possible and matches the learning context they operate in. In taking such decisions, the washback effect on learning has to be considered.

Having only dealt with three of the ten issues outlined at the beginning of this post, the picture is far from being complete. What is clear is that there are no clear norms as yet, unless one decides to adopt in toto an existing assessment framework such as the CEFR’s (http://www.coe.int/t/dg4/education/elp/elp-reg/Source/Key_reference/Overview_CEFRscales_EN.pdf ). This means that MFL departments have the opportunity to make their own norms based on an informed understanding – to which I hope this post has contributed – of the FAC constructs and of the other crucial dimensions of L2 performance and proficiency assessment that I will deal with in future posts.

Of the ‘curse’ of tense-driven progression in MFL learning

July 16, 2015May 31, 2016 Gianfranco Conti, Phd (Applied Linguistics), MA (TEFL), MA (English Lit.), PGCE (Modern Languages and P.E.) Assessment, General pedagogy, Grammar teaching, Uncategorized

For too many years the UK National Curriculum posited the ‘mastery’ of tenses as the main criteria for progression along the MFL proficiency continuum. A learner would be on Level 4 if s/he mastered one tense + opinions, on Level 5 if s/he mastered two, etc. This preposterous approach to the benchmarking of language proficiency has always baffled me and has caused enormous damage to MFL education in the UK for nearly two decades. Not surprisingly I felt relieved when the current British government ‘scrapped’ the National Curriculum Levels. Sadly, this approach to progression is so embedded in much UK teaching curriculum design and practice that it will be very difficult to uproot, especially considering that some Examination boards still place too much emphasis on tenses in their assessment of GCSE examination performance.

But why am I so anti- tense-driven progression? There are two main reasons. First and foremost, the expressive power of a speaker/writer in any language is not a function of how many tenses s/he masters; it is more a function of – in no particular priority order:

How much vocabulary (especially verbs, nouns and adjectives) s/he has acquired;
How flexibly s/he can apply that vocabulary across context;
How intelligible his/her output is;
How effectively s/he can use time- markers (which will clearly signpost the time dimension we are referring to in communication);
How effectively s/he masters the various functions of discourse (agreeing, disagreeing, evaluating, etc.) which will hinge on his/her knowledge of discourse markers (however, moreover, etc.) and subordination;
How effectively s/he masters L2 syntax; etc.

In fact, in several world languages tenses do not really exist. In Bahasa Malaysia, for instance, one of the official languages of the beautiful country I live in, tenses – strictly speaking – do not exist. The past, the present and the future are denoted by time adverbials, e.g. one would say ‘Yesterday I leave my wallet in the hotel room’. Sentences like this one, would convey more meaning than the more accurate ‘I left my wallet in the hotel room’, since it is perfectly intelligible and more useful if one needs to tell the owner of the hotel one stayed in last week, when the wallet was left behind. Yet, according to the former National Curriculum Levels the second sentence would be a marker of higher proficiency…

Placing so much emphasis on the uptake of tenses skews the learning process by channeling teachers and students’ efforts away from other equally or even more important morphemes and aspects of the languages, which somehow end up being neglected and receiving little emphasis in the classroom and textbooks. It also creates misleading beliefs in learners about what they should prioritize in their learning.

This is one of the main problems with tense-driven progression, but not the main one. The most problematic issue refers to the pressure that it puts on teachers and learners to acquire as many tenses as possible in the three KS3 years. This is what, in my view, has greatly damaged British MFL education in the last 20 years, since the UK National Curriculum Levels were implemented. Besides resulting in overemphasizing tense teaching, such pressure has two other very negative outcomes.

Firstly, many teachers end up neglecting the most important dimension of learning – Cognitive Control. This occurs due to the fact that not enough time is devoted to practising each target tense; hence MFL students often learn the rules governing the tenses but cannot use them flexibly, speedily and accurately under communicative and/or time pressure. The pressure to move up one notch, from a lower level to a higher level – often within the same lesson – reduces the opportunities for practice that students ‘badly’ require to consolidate the target material, unduly increasing cognitive overload.

Secondly, often students are explicitly encouraged or choose to memorize model sentences which they embed in their speech or writing pieces in order to achieve a higher grade, learning them ‘ad hoc’ for a scheduled assessment. This would be acceptable if it led to acquisition or if it were supported by a grasp of the tenses ; but this is not always the case.

In conclusion, I advocate that the benchmarking criteria that UK teachers adopt explicitly or implicitly, consciously or subconsciously to assess progression in MFL learning should be based on a more balanced approach to the measurement of proficiency; one which emphasizes discourse functions, range of vocabulary (especially mastery of verbs and adjectives) and pronunciation, much more than it currently – a year since the National Curriculum Levels were abolished – still does. As I have often reiterated in my posts, teaching should concern itself above all with acquisition of cognitive control rather than with the learning of mere rule knowledge. Progression should be measured more in terms of speed and accuracy of execution under real-life-like communicative pressure, width of vocabulary, functions and structures mastered as well as syntactic complexity. Tenses are important, of course, but they should not take priority over discourse features which are more crucial to effective communication.

Five important flaws of GCSE oral tests

June 8, 2015May 31, 2016 Gianfranco Conti, Phd (Applied Linguistics), MA (TEFL), MA (English Lit.), PGCE (Modern Languages and P.E.) Assessment, Uncategorized

Research has highlighted a number of issues with oral testing which examination boards and teachers need to heed, as they can have important implications not just for the way GCSE syllabi are designed, but also for the conduct and the assessment of oral GCSE exams as well as for our teaching. The issues which I will highlight are, I suspect, generalizable to either types of oral tests conducted in other educational systems, especially when it comes to the reliability of the assessment procedures and the authenticity of the tasks adopted. They are important issues, as they bring into question the fairness and objectivity of the tests as well as whether we are truly preparing children for real L2-native-like communication.

Issue n.1 – Instant or delayed assessment?

A study by Hurman (1996), cited in Macaro (2007) investigated to what extent examiners’ assessment of content and accuracy of candidates’ responses to GCSE role-play affected tests. Hurman took 60 experiences examiners and divided them into groups; one spent some time before awarding the mark and one did it instantaneously. Hurman’s findings indicate that waiting a few seconds before awarding the mark seem to result in more objective grading. This, in my view, is caused by the divided attention that listening and focusing on assessment causes – I have experienced this first-hand many times!

This has important implications for teachers working with certain examination boards. Cambridge International Examinations board, for instance, prescribes that, at IGCSE, the teacher/examiner award the mark instantaneously and explicitly forbids the practice of grading the candidates retrospectively or after listening to the recording. If Hurman’s findings were to be true of the vast majority of examiners, examination boards like CIE may have to change their regulations and allow for marking to be done retrospectively when the examiner’s attention is not divided between listening to the candidate’s response to a new question, when still marking the previous one – an ominous task!

Issue n.2 – What does complexity of structures/language mean?

This is another crucial issue, which I found year in year out when moderating oral GCSE/IGCSE candidate’s oral performance during my teaching career. Teachers listening to the same recording usually tend to agree when it comes to complexity of vocabulary but not necessarily when it comes to complexity of grammar/syntactic structures. Chambers and Richards’ (1992) findings indicate that this is not simply my experience; their evidence suggests that there was a high level of disagreement amongst the teachers involved in their study as to what constituted ‘complexity of structures’. They also found that the teachers disagreed also in terms of what was meant by ‘fluency’ and ‘use of idiom’ – another issue that I have experienced myself when moderating.

To further complicate the picture, there is, in my view, another issue which research should probe into, and I invite colleagues who work with teachers of nationalities to investigate; the fact, that is, that L1-target-language-speaker raters tend to be stricter than L2-target-language-speaker ones. This issue is particularly serious in light of Issue n.5 below.

Issue n. 3 – Are the typical GCSE oral tasks ‘authentic’?

I often play a prank on my French colleague Ronan Jezequel , by starting a conversation about the week-end just gone by asking question in a GCSE-like style and sequence until, after a few seconds, he realizes that there is something wrong and looks at me funny… Are we testing our students on (and preparing them for) tasks that do not reflect authentic L2 native speaker speech? This is what another study by Chambers (1995) set out to investigate. They examined 28 tapes of French GCSE candidates and compared them to conversations on the same themes by 25 French native speakers in the same age bracket. They found that not only, as easily predictable, the native speakers used more words (437 vs 118) and more clauses (56.9 vs 23.9), but also that:

The French speakers found the topic house/flat description socially unacceptable;
The native speakers found the topic ‘Daily routine’ them unauthentic and – interestingly – produced very few reflexive verbs
The native speakers used the near future whilst the non-natives used the simple future
The native speakers used the imperfect tense much more than the non-natives
The non-native speakers used relative causes much less than the French

Are these tests, as the researchers concluded, testing students’ ability to converse with native speakers or their acquisition of grammar?

Issue n.4 – The grammar accuracy bias

A number of studies (e.g. Alderson and Banerjee, 2002) have found time and again that assessors’ perception of grammar accuracy seem to bias examiners, regardless of how much weight is given in the assessment specification on effective communication. This issue will be exacerbated or mitigated depending on the examiners’ view of what linguistic proficiency means and by their degree of tolerance of errors; whereas a teacher might find a learner’s communicatively effective use of compensation strategies (e.g. approximation or coinage) a positive thing even though it leads to grammatically flawed utterances, another might find it unacceptable.

Here again, background differences are at play. Mistakes that to a native speaker might appear as stigmatizing or very serious might seem mild or not even be considered as mistakes at all…

Issue n.5 – Inter-rater reliability

This is the biggest problem of all and it is related to Issue n.2, above; how reliable are the assessment procedures? Many years of research have shown that for any multi-trait assessment scale to be effective it needs to be extensively piloted. Moreover, whenever it is used for assessment, two or more co-raters must agree on the scores and, where there is disagreement, they must discuss the discrepancies until agreement is reached. However, when the Internal moderator and the External one, in cases where the recording is sent to the Examination board for the assessment, do not agree…what happen to the discussion that is supposed to take place to reach a common agreement?

Another important issue relates to the multi-traits assessment scales used. First of all they are too vague. This is convenient, because the vaguer they are the easier it is to ‘fiddle’ with the numbers. However, the vagueness of a scale makes it difficult to discriminate between performances when the variation in ability is not that great as it happens in a top set class, for example, with A and A* students. In these cases, in order to discriminate effectively between an 87 and a 90% which could mean getting or not an A*, research shows clearly that the best assessment to be used should contain more than the two or three traits (categories) usually found in GCSE scales (or even A Level, for that matter) and, more importantly, should be more fine-grained (i.e. each category should have more criterion-referenced grades). This would hold examination boards much more accountable, but would require more financial investment and work, I guess, on their part.