On 8 common pitfalls of grammar assessment and the art and science of keeping it real

A few days ago I came across a tweet in which a hard-working and very capable language educator I know, an experienced head of Dept at one of the NCELP hub schools, claimed her students showed evidence, in a recently administered test, of impressive progress in the learning of a grammar structure. The author of the tweet, henceforth referred to as ‘Teacher X’, attached a snapshot of the test, designed to assess the learning of a fairly complex Spanish structure, the use of the indirect pronoun with verbs like ‘gustar’.

From what I could glean from the picture shared in the tweet, the test appeared to consist of at least three parts:

1. A task whereby the students were required to choose which of two options was the grammatically correct one;

2. A grammaticality judgment task whereby the students were to determine the correctness of a set of sentences;

3. A task which included a mix of L1 to L2 and L2 to L1 translation : 1/2 the sentences were to be translated from Spanish to English and the other 1/2 from English to Spanish. Each sentence contained an instance of the deployment of the target grammar structure.

In this post I intend to show how the test, the interpretation of its results and the claims made by Teacher X about its outcomes exemplify some common pitfalls of much grammar assessment which undermine the reliability and validity of testing practices in many school settings. Yet the tweet was retweeted by the NCELP official twitter account, which meant that the National Centre of Excellence in Language Programming endorsed the content of the tweet and tacitly approved of Teacher X’s testing practices and claims.

Let us have a look at the issues with that assessment that one can easily identify at a glance.

Pitfall 1: ‘either…or…’ grammar assessment tasks

Assessment tasks whereby the students have a 50/50 chance of getting the answer right through random guessing are evidently unreliable tests of grammar competence. It may surprise you that a NCELP hub school would use a test so obviously invalid – after all, the ‘E’ in NCELP stands for ‘excellence’. But actually, you shouldn’t be: the very director of the NCELP, Emma Marsden, in a peer-reviewed study of hers which she often cites as evidence of the success of grammar instruction (Kasprowitz & Marsden, 2017) used exactly the same type of task. Professor Frank Boers, in his excellent 2021 book, reviews this study, criticizing Kasprowitz and Marsden’s testing approach and describing the results obtained by Marsden and her co-worker (also part of the NCELP team) as ‘disappointing’ (see figure 1 below which summarizes Frank Boers’, 2021, criticism).

Figure 1 – a summary of the points Boers (2021) makes with regards to Kasprowitz and Marsden’s (2017) study with young learners of German. The test format entailed a 50% chance of correct guessing

Pitfall 2 – A task or tasks within the same test paper providing cues to the students on how to execute other tasks

The test-at-hand contained, in the same translation task, an alternance of L2 and L1 sentences (in which the target structure was task-essential) to translate respectively into the L2 and in the L1. This too undermines the reliability of the test, as the students can of course use the L2 sentences as reminders of the target grammar rule(s) or, should they have forgotten the rule(s), even as worked examples from which to infer how to go about translating from the L1 to the L2. For example, if I am testing somebody on the French perfect tense, and ask them to translate into English sentence (i) below:

     (i) J’ai mangé de la viande

And then ask the students to translate into French sentence (ii) below:

     (ii) I ate some chicken

A student who can translate the first sentence correctly into English but is not 100% sure of how to translate the second one into French can easily ‘cheat’ by copying the first portion of sentence (i).

Pitfall 3Scoring translation tasks to asses the learning of a grammar structure holistically

If the translation of a sentence is used as a means to assess grammar, how is the translation of the portion of that sentence which doesn’t contain the target structure scored? In other words, if the to-be-translated sentence reads

They don’t like reading fashion magazines because it’s boring

and the target structure is the use of the indirect pronoun + gustar in Spanish, what happens if a student gets ‘Les gusta’ right but gets everything else wrong? Should they be penalised? By right, if it is a grammar test aimed exclusively at ascertaining the extent to which students master the usage of indefinite pronouns + gustar, the students should score full marks for that sentence. No? The test-at-hand appeared to grade the sentences in terms of accuracy across the board, including the vocabulary and the other structures embedded in the sentences.  Hence, if someone gets the target structure wrong, but translates the rest of the sentence correctly, they may obtain a higher grade than someone who gets the target structure right but gets the rest of the sentence wrong. With this in mind, it is obvious that the test score is unlikely to provide a valid assessment of the learners’ mastery of the specific structure the assessment was designed to target.

Pitfall 4 – Lack of authenticity and transferrability of knowledge

One of the five principles of effective assessment (see figure 2 below) advocated by the most renowned scholars in the field of L2 assessments is authenticity, i.e. the tasks included in the test administered should mirror or at least approximate real-life tasks (Brown, 2004; Purpura, 2006; 2011)

Figure 2: The five effective-assessment principles on which the most eminent L2 assessment specialists worldwide universally agree.

Why is authenticity – or at least an approximation of authenticity – so important? The answer refers to the transfer appropriate processing phenomenon or T.A.P., which states that knowledge is context specific, i.e.: whenever we retrieve knowledge, retrieval is more likely to be effective when the conditions at retrieval are similar to the conditions at learning. Hence, for instance: if I practise using locative adverbs/adverbials in French only or mainly through gap-fill or grammaticality-judgment tasks (e.g. is this sentence correct or incorrect?), I will be unlikely to use it effectively in a conversation with a French speaker about where the places I want to see are located. On the other hand, if I practise the deployment of locative adverbs/adverbials in the context of role-plays, I might be able to transfer it to a real-life interaction in which I ask for directions, for instance.

Another dimension of transferrability refers to the modality-specificity of L2 competence. In other words: what I learn through a skill (e.g. writing) is unlikely to be easily transferred to another (e.g. speaking). Hence, even though I may write fluently in the perfect tense in French, I may not be able to use it fluently in speech. The obvious implication of this is that grammar teaching and assessment must be multi-modal.

Now, if we evaluate the NCELP’s grammar revision, homework and assessment tasks in the light of TAP, we can easily conclude that they are not fit for purpose as they lack authenticity; they don’t typically practise/test grammar knowledge across all four language skills and do not include task-essential communicative task which develop/assess spontaneous deployment of the target L2 structures.

Figure 3: Transfer Appropriate Processing is at play when we attempt to transfer any knowledge acquired through a context/task to another. It is very much like training a puppy to perform a trick at home only to find out that they can’t perform it at the park (because the surrounding environment has changed).

Pitfall 5 – Grammar-knowledge-only assessments

Purpura (2006) makes a distinction between grammar knowledge versus grammar ability which mirrors Larsen Freeman’s one between Grammar and Grammaring and Krashen’s famous dichotomy Learning versus Acquisition. Grammar knowledge refers to declarative knowledge, i.e. the conscious application of grammar rules; grammar ability instead, refers to the ability to apply grammar rules in fluent spoken production, in other words,  Procedural knowledge.

In real-life oral interaction, the usefulness of grammar knowledge accrued through grammatical knowledge tasks (e.g. ‘Correct or Incorrect?’), Gap-fills, ‘Either..or…’ tasks, etc. is very limited not only because these tests flout the ‘authenticity’ principles, but also because, as Wilelm Levelt’s model of word production (the most widely accepted to-date) posits, in order for grammar retrieval to be useful in fluent spoken production, it must occur in a split second. (see figure 4, below).

Figure 4: Wilelm Levelt’s model of word production. When we retrieve a word, the brain first activates its meaning, then its grammar and syntax. In oral production, this process happens in a few hundred milliseconds, which means that grammatical knowledge must be accessed very fast.

So, if we accept the account of skill acquisition provided by Skill theory (e.g. Anderson, 1980) and espoused by the NCELP, Teacher X’s test evidences – at best – that her students are at the beginning of the skill-acquisition curve, I.e. at the awareness stage (see figure 5 below). In other words, the claim by Teacher X that her students had learnt the target structure should be majorly scaled down or the term ‘learnt’ be clarified: what does she mean by it? Her students still have many months or even years to go before they are able to deploy the indirect pronoun + gustar construction in fluent speech. Let me remark, incidentally, that no NCELP assessment does, at least to my knowledge, test learner spontaneous use of the target grammatical structures, even though they do claim on their website and on some of their CPD resources that teaching should aim at automatising knowledge. So one isn’t clear how spontaneity in the use of any of the target structure in their schemes of learning is going to be achieved.

Figure 5: The key stages in the acquisition of a grammar structure according to skill-theory. Fluency in the spontaneous deployment of a grammar structure is a very lengthy process which might take several years.

Pitfall 6: The natural sequences of acquisition

Another important issue further exacerbates the problems discussed in point 5 above: the target structure in Teacher X’s test paper is beyond the current developmental reach of her students (year 8 – UK system). In fact, the use of the indirect pronoun + gustar (and similar verbs) in Spanish emerges quite late in L2-Spanish learners’ spontaneous output (in other words, it is acquired late in the acquisition process). Hence, whilst one can test one’s students’ grasp of the grammar rule, one cannot, by any stretch of imagination, at such an early stage in the L2 learning journey, claim that the students will actually acquire it any time soon.

Processability theory posit that there are fixed developmental sequences in the acquisition of a second language which grammar instruction cannot circumvent but may be able to accelerate (Pienemann, 1998). As can be gleaned from the slide in figure 6, the structure Teacher X tested her beginner learners on entails procedure 5 (sentence procedure) which cannot be acquired by a typical beginner learner as it requires the mastery of procedure 1,2,3, and 4 which are never fully mastered at this level.

Figure 6: Manfred Pienemann’s developmental sequences in L2 acquisition. The theory, which has been proven right by a large number of studies, states that you cannot move to a more advanced procedure unless you have a fairly high degree of mastery in the preceding ones.

Pitfall 7– Highly telegraphed tests. High retrieval strength and the illusion of mastery

Usually a class sits a grammar assessment at the end of a series of lessons on a specific structure (e.g. forming the perfect tense with ETRE) or set of structures (perfect tense usage and formation in French as a whole). This means that retrieval strength for that given structure is likely to be high. Why? Because the teacher will have firmly kept the target structure in the students’ focal attention by practising it lesson in lesson out for a few weeks and by providing corrective feedback on its deployment in oral and written work. So, when the test on that grammar structure is administered, the students know exactly what is expected of them. This state of things is of course exacerbated when the students are told explicitly that the test is going to be on that particular grammar structure – retrieval strenght will be even higher in this case. In such testing conditions, a good chunk of the students is likely to do fairly well, thereby giving the teacher the impression that the students have now mastered the target structure. Exactly what Teacher X was claiming in her tweet.

Now, imagine testing the same students on that very same grammar structure 4-5 weeks down the line without any prior revision and without ‘telegraphing’ the test. Will they do as well? The answer is: unlikely. Plenty of studies show that, not only the learners will be unlikely to use it spontaneously in production and transfer it to unfamiliar tasks, but also that many of them will have forgotten how to use it, especially if there are major cross-linguistic L1-L2 differences in the usage of the target structure (negative transfer). The main reasons for forgetting refer, of course, to (1) memory decay (2) proactive/retroactive interference and (3) cue-dependent forgetting.

Pitfall 8: Are we testing grammar-rule application or the retrieval of memorised exemplars?

When one examines Teacher X’s test, it is obvious that the sentences used to assess the students on the target grammar structure had been used several times in the lessons prior to the assessment. How do we know that? Because those sentences or very similar ones do occur multiple times in the NCELP’s resources on that grammar structure. Hence, the construct validity of the test is undermined, in the sense that we don’t really know whether the students are actually applying the grammar rule or have simply memorised the sentences through exposure or use in the lessons running up to the test.

Concluding remarks

In this post I have identified and discussed a number of common issues in grammar assessment which undermine the validity and reliability of the data thereby obtained. My criticism wasn’t meant to be an ad hominem attack on Teacher X. After all, she has been trained by NCELP in the use of their instructional and assessment practices and was only applying what she got out of their training.

What is important to take away from the above discussion is that before assessing grammar we must have a clear understanding of what it actually means to KNOW grammar. As a teacher, I need to be clear as to what extent and how I expect my students to know and evidence the learning of the target grammar structure(s) by the end of each lesson, sub-unit, unit, year or cycle. That clarity will inform my assessment practices. Testing whether students have understood how a given structure works (awareness) will require a different assessment task than testing whether that structure has been automatised (fluency).

Another important point is that the claims teachers make about their students’ grammar learning  need to be mediated by our understanding of what grammar acquisition involves and by what constitutes VALID testing. We need to be specific as to what we mean by “My students have learnt the French perfect tense”, as (1) grammar knowledge is context- and modality- specific, (2) is constrained by the developmental sequences of acquisition and (3) can be conscious (explicit) or subconscious (implicit).

Finally, tests must be valid and reliable before we can make bold claims about how impressive our students’ progress in grammar learning is – like Teacher X did in her tweet. By making such claims and advertising them to the twitterverse with the keen support of NCELP, one may end up misleading the language-teaching community into adopting assessment practices which – as I have tried to demonstrate above- , are actually flawed in many ways.

To find out more about the approach, do get hold of the best-selling book authored by myself and Steve Smith, “Breaking the Sound Barrier: teaching learners how to listen”. or attend my upcoming workshops

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s