How the speaking process unfolds in the brain and the FIVE PILLARS of speaking instruction

Introduction


Alongside listening, speaking is the skill that scares learners the most—and rightly so. Unlike writing, where you can take your time and carefully polish your words, speaking happens on the fly. There’s no backspace key. No time to hesitate. Everything has to come together in real-time: the message you want to convey, the grammar to wrap it in, the vocabulary to fill it out, and the sounds to articulate it. And if you’re learning a second language, the load becomes even heavier. Each sub-process—planning, retrieving, encoding, monitoring—takes up your mental energy like separate strings pulling in different directions


In this article, I focus primarily on teaching speaking to lower-intermediate learners, corresponding roughly to the B1 level of the CEFR. At this level, learners can communicate in everyday situations, handle short social exchanges, and describe experiences or events in simple terms, but often struggle with fluidity, accuracy, and vocabulary depth. In the UK, many GCSE students—particularly in their final year—fall somewhere between A2 and low B1, depending on exposure, instruction quality, and individual aptitude. While GCSE specifications may claim alignment with B1 outcomes, most learners operate with far more limited productive fluency, especially in spontaneous speech.


For learners operating at lower-intermediate or intermediate level, this makes speaking a cognitively exhausting endeavour. Planning what to say in a foreign language under time pressure—while also keeping track of how you’re being understood—is no easy feat at this level of proficiency where vocabulary is limited and grammar and pronunciation are far for being proceduralised thereby requiring a lot of simultaneous juggling of challenging cognitive operations.


One of the most influential frameworks for understanding how speaking unfolds is Levelt’s (1989) model of speech production. Originally designed to describe L1 speaking processes, the model has been widely adopted and adapted within L2 acquisition research. It outlines four key stages: conceptualisation, formulation, articulation, and monitoring. In L2 contexts, scholars such as Kormos (2006) have extended the model to include the impact of limited attentional resources, slower lexical retrieval, and interference from the learner’s first language. These modifications are crucial, as they highlight that L2 speech is not simply “slower L1 speech,” but involves qualitatively different challenges, particularly in the coordination of sub-processes under pressure.

This article outlines the cognitive sub-processes involved in the act of speaking, referring to Levelt’s original model and L2-specific extensions. Each stage will be examined in detail, with a focus on the time it unfolds, its cognitive demands, and how it affects L2 speakers. Finally, we explore how a process-oriented approach to listening and speaking instruction—beginning with lexical chunks and culminating in fluency training—can mitigate these challenges.

How the speaking process unfolds in the brain

The flow chart below visually represents the key sub-processes involved in speaking, starting with conceptualisation, where the speaker decides what to say. It then moves to formulation, where vocabulary and sentence structure are selected. Phonological encoding follows, as the speaker prepares the sounds for articulation, including aspects like stress and intonation. The articulation step represents the physical production of speech. Finally, monitoring occurs, where the speaker checks for errors and makes corrections if needed. Each of these stages presents unique challenges for second language learners, such as difficulties in vocabulary recall, grammar application, pronunciation, and maintaining speech flow while self-monitoring.

Let’s zoom in

Conceptualisation: From Intention to Preverbal Message


According to Levelt’s (1989) influential speech production model, the first stage in speaking is conceptualisation—the process of formulating an intention and generating a ‘preverbal message’. This is where the speaker decides what they want to say based on their communicative goals, the context, and their interlocutor.


This stage is largely non-linguistic and draws heavily on working memory and attention (Baddeley, 2000). In L2 users, this phase is often slowed by limited automaticity in accessing ideas or by difficulties in filtering what is relevant for the context. Lower-intermediate learners in particular struggle with task schemata—knowing what content is expected in specific interactions (Bygate, 2001). The typical time window for this initial conceptual preparation is 200–400 milliseconds (Indefrey & Levelt, 2004).

Formulation: Lexical Selection and Grammatical Encoding


Sticking with our French weekend example, the learner now attempts to say something like “J’ai regardé un film avec mes amis.” To do this, they must retrieve verbs in the passé composé, choose the correct auxiliary, recall agreement rules, and access the noun and modifiers. For lower-intermediate learners, this is where it often falls apart. They might know the verb “regarder” but hesitate on the auxiliary—avoir or être? They might reach for “copains” instead of “amis,” or get stuck trying to recall the correct article.


Once the preverbal message is ready, the speaker moves into formulation, where the message is encoded linguistically. This involves:


• Lexical selection: choosing appropriate content and function words


• Grammatical encoding: applying morphological and syntactic rules to create well-formed utterances.


This phase is cognitively taxing, particularly for L2 learners. Vocabulary retrieval in an L2 is significantly slower (Segalowitz, 2010), and grammatical encoding is often interrupted by underdeveloped procedural knowledge (DeKeyser, 2007). Moreover, lexical access in L2 speakers is more susceptible to interference from the L1, which can cause lexical or structural errors.


The time estimates for lexical selection are roughly 150–250 milliseconds per word, depending on familiarity and fluency (Levelt et al., 1999; Indefrey & Levelt, 2004). Sentence-level formulation can take longer, especially in less automatized learners. In this very narrow time window the language learner needs to retrieve the correct vocabulary, apply any morphological rule and then sequence the words in the correct order. Unsuprisingly, many students who have not been taught vocabulary and grammar orally fail at this stage in the process. Imagine a year 8 or 9 students having to retrieve the words required to describe what they and their friends did last weekend whilst simultaneously having to apply the rules of the perfect tense of verbs requiring the auxiliary Etre in 250 milliseconds! No wonder they usually answer using prefabricated chunks !

Phonological Encoding and Articulation


Assuming the formulation phase is successful, the learner must now articulate: “J’ai regardé un film avec mes amis.” But here too, things get tricky. Mispronunciation of “regardé,” deaccentuation of “mes amis,” or poor rhythm can impair intelligibility. A frequent issue is the liaison in “mes amis”—if not made, the learner’s speech sounds choppy or unclear. Or the learner might struggle with the uvular [ʁ] in “regardé,” substituting a harder English-like ‘r’ that interferes with intelligibility. These phonological glitches are common even when vocabulary and syntax are intact.


In this stage, the speaker organises the phonological form of the utterance. This includes retrieving the correct pronunciation, applying prosodic features (intonation, rhythm), and preparing motor plans for articulation.


Fluent L1 speakers can initiate articulation within 600–750 milliseconds of conceptualisation (Meyer, 2000), but L2 learners may hesitate, pause, or mispronounce words due to weak phonological encoding. This is especially evident in learners with low exposure to authentic spoken input or limited phonological memory (Service, 1992).


Lower-intermediate learners often struggle with:


• Phoneme discrimination and recall
• Prosody (especially in stress-timed languages like English)
• Applying correct intonation in real-time


These issues compound when learners are under pressure to speak fluently, increasing their cognitive load and sometimes causing breakdowns in communication.

Monitoring: Self-Regulation and Repair


The final sub-process is monitoring, where the speaker evaluates their output for accuracy and appropriateness. Levelt (1989) conceptualised this as an internal speech comprehension loop: the speaker hears their own output and compares it with their intention.


Even if the learner says “J’ai regardé un film avec mes amis,” they might instantly second-guess themselves. Was it regardé or regardais? Should they have said copains instead of amis? This internal checking process can lead to unnecessary corrections or hesitations—”J’ai… euh… j’ai regardé… non, j’ai vu un film…” These repairs slow down speech and can reduce fluency, especially if the learner is preoccupied with form over communication. Encouraging learners to tolerate minor slips and correct after speaking can reduce this form-focused overload.


In L2 learners, the monitoring system is often overloaded. Lower-intermediate speakers may lack the fluency to detect errors in real-time, or they may be too focused on accuracy, causing frequent self-repairs, hesitations, and a loss of fluency (Kormos, 2006). The balance between fluency and accuracy in self-monitoring is often skewed towards caution, leading to reduced confidence and processing speed.

Cognitive Bottlenecks for L2 Speakers


For lower-intermediate to intermediate learners, the real-time nature of speaking creates several cognitive bottlenecks:
• Slow lexical retrieval: due to lack of automaticity and limited exposure
• Grammatical processing overload: conscious rule-application slows down encoding
• Phonological instability: weak sound representations affect fluency and intelligibility
• Overactive monitoring: learners focus too much on error-avoidance rather than message delivery
As a result, these learners often rely on formulaic expressions, pauses, fillers, and simplified syntax to manage cognitive load.

Implications for a Process-Based Approach: the FIVE PILLARS of speaking instruction


Understanding the cognitive complexity of speaking has major implications for classroom instruction—particularly if we truly want to go beyond ‘speaking practice’ and actually develop real-time speech competence. Rather than treating speaking as a single monolithic skill, we need to see it as a layered process. Each sub-skill—conceptualising, retrieving lexis, applying grammar, encoding phonology, articulating and monitoring—must be nurtured in its own right, and gradually automatized through carefully scaffolded instruction.


One key implication is this: if we’re serious about helping our learners speak fluently, we must abandon the traditional ‘accuracy-first’ model that floods learners with grammar rules, then expects them to string words together on the fly. It simply doesn’t work—not in real-time conditions where cognitive load is already through the roof. Instead, learners need repeated, structured exposure to lexis and grammar in context, followed by masses of retrieval and recycling across the modes. This includes input processing, controlled output, guided fluency training and carefully spaced retrieval. Each phase of this cycle must map clearly onto a specific stage of the speech production model. And, most importantly, it must feel safe and doable for the learner.


In my own practice, I’ve found that modelling language through high-frequency lexical chunks, sentence builders and communicative routines creates a reliable scaffold. When learners can plug content into predictable structures, they’re free to focus their cognitive energy on message construction and pronunciation. That’s when real fluency starts to emerge—not when they’re mentally conjugating verbs while trying to hold a conversation.


Monitoring, too, deserves special attention. Learners at this level often monitor too much—pausing, correcting, second-guessing. We need to re-train them to delay monitoring until after their message is out. Recording, re-listening, summarising, peer editing—all of these build confidence and reduce the urge to self-correct mid-sentence.


Finally, let’s not forget the bigger picture: speaking proficiency is deeply rooted in listening. You can’t produce what you haven’t processed. That’s why I always recommend beginning with listening-as-modelling—intensive, scaffolded, chunk-based listening input that feeds into structured oral output. Only when input is rich, patterned, and digestible can output become fluent.


In what follows, I outline five pillars of process-based instruction that address the major bottlenecks identified above. Rather than viewing speaking as a single monolithic skill, instruction should address each sub-process through targeted practice and gradual automatization.

1. Begin with Lexical Chunks


Let’s start with the basics. If we want to reduce the cognitive load associated with formulation, we must give learners language they can draw on quickly and easily. This is where teaching lexical chunks—pre-assembled word sequences—makes all the difference. Following Wray (2002) and Nation (2013), instruction should start with frequent, high-utility lexical chunks that serve communicative functions. These bypass the need to assemble utterances from scratch and give learners the scaffolding they need to speak more fluently from the start.


In my own approach, sentence builders and oral fluency routines built around these chunks are core. When learners can retrieve and manipulate these ready-made building blocks, they’re no longer paralysed by the need to “find the right word” or mentally conjugate verbs mid-sentence. The end result? Increased fluency, confidence, and willingness to engage.

Following Wray (2002) and Nation (2013), instruction should start with frequent, functional lexical chunks. These bypass the formulation phase by providing ready-made building blocks. In my approach, for instance, sentence builders, retrieval practice and oral fluency tasks built around these chunks are used in order to reduce planning time and boost fluency.

2. Support Grammar Proceduralisation


Grammar doesn’t just need to be learned—it needs to be automatised. Far too often, learners are expected to remember isolated rules and apply them in real time, under pressure. Unsurprisingly, they struggle. What we need instead is a gradual shift from declarative to procedural knowledge—what DeKeyser (2007) and Ellis (2002) have long advocated.


In practical terms, this means designing tasks where learners are repeatedly exposed to key structures in varied, meaningful contexts. One set of activities in this process is repetitive oral drills, or “chunking aloud,” where students repeatedly practise grammatical structures in varied contexts to reinforce their automatic recall. This may be followed by oral retrieval practice tasks where students tests one another on the target chunks of language (e.g. Oral ping-pong, Battleship, Snaked and Ladders or No snakes no ladders). This is complemented by controlled speaking practice, where learners engage in structured dialogues or speaking tasks that focus on a specific grammar point, providing the opportunity to use the form in context. Additionally, sentence expansion and transformation exercises encourage learners to manipulate sentences by changing components or structures, which helps them internalise grammar rules through active use. Communicative activities, such as information gaps and role-plays, further promote the use of grammar in real-life contexts, enhancing both fluency and accuracy. Feedback, both immediate and delayed, plays a key role in identifying errors and reinforcing correct grammatical usage, ensuring that learners are able to reflect on their mistakes and adjust their use of grammar in future speaking tasks. These activities work together to support grammar proceduralization, allowing learners to move from conscious rule application to the automatic use of grammatical structures in spontaneous communication.

In the above tasks the target grammar structures are made ‘task essential’, i.e. necessary for the completion of a task. For instance, you may design a Mind-reading and Sentence Stealer game followed by an ‘Oral Ping-Pong’ and a ‘No snake no ladder task’, and by a short dialogue (with L1 prompts) to be translated orally where the French verb Faire in the present features in every single sentence. This may be followed by a Spot the difference task where Partner 1 and 2 have to describe their respective pictures still using Faire in the present. Finally, you could stage a game of Faster recycling the same verb. You would hope, at the end of this sequence to have reached a degree of proceduralization of the target verb in the present, wouldn’t you?

In essence, grammar instruction should aim at proceduralisation—not just rule explanation. This can be achieved through repeated use in familiar contexts (Ellis, 2002), pattern drills, and structured input tasks where grammar is embedded in meaningful communication.

3. Enhance Phonological Awareness


Phonological encoding is the silent saboteur of L2 fluency. Learners might know what they want to say—but if they can’t retrieve the sounds or stress patterns of the words, their message stalls. This is especially true for learners from syllable-timed L1 backgrounds trying to speak stress-timed languages like English.


So what’s the fix? Learners need systematic training in phonological decoding and encoding. Activities like minimal pair discrimination, prosody shadowing, and rhythm tapping are not ‘nice extras’—they are essential. Listening-as-modelling, one of the key pillars of my instructional framework, plays a central role here. By repeatedly hearing and mimicking well-modelled input, learners internalise the rhythm and stress patterns that underpin fluent delivery. Chunking aloud and other reading-aloud techniques, too, of course, play a key role.

4. Train Strategic Monitoring


Learners often monitor their speech too much—and too early. The result? Frequent pauses, self-corrections, and disrupted communication. What they need is training in strategic monitoring: learning when and how to correct themselves in a way that supports fluency rather than undermines it.


One way to do this is by using recording and playback tasks, where learners speak first and evaluate later. Another is to apply fluency-then-accuracy sequences, where learners produce language freely before revisiting their output for improvement. As Kormos (2006) suggests, shifting monitoring to a post-production phase can free up working memory and reduce performance anxiety. Of course, not every learner will need this level of scaffolding, but it can be transformative for those at risk of fossilising or losing confidence.

Do we have the time to do the above with every student and class? Maybe not, maybe only with your exam classes, but it is well worth the time you are prepared to invest in these activities.

5. Prioritise Fluency Training


Fluency doesn’t just happen. It must be explicitly taught, nurtured, and rehearsed systematically. And no, fluency isn’t just about speaking fast—it’s about the seamless coordination of all sub-processes under time pressure. This is what makes it so cognitively demanding and what makes explicit fluency training such a pedagogical priority.


Drawing on the work of Nation (1989, 2013), we must treat fluency as a skill in its own right, with structured and repeated opportunities for learners to speak under progressively less scaffolded, more time-sensitive conditions. Time-limited speaking tasks, repeated performance activities (such as the 4-3-2 technique), and familiar-task recycling allow learners to gradually speed up the retrieval and formulation process.


In my own framework, fluency is the final stage of a cycle that begins with highly scaffolded input (listening as modelling), builds through structured output (with sentence builders and oral frames), and culminates in ‘pushed output’. Here, learners are encouraged to retrieve and manipulate language chunks quickly and spontaneously in a controlled environment. Activities such as ‘Messengers’, the ‘4,3,2 technique’, ‘Market place’, ‘Faster’ and ‘Fast and Furious’ are great ways to work on oral fluency. We gradually increase the demands—not just on speed, but also on accuracy and complexity—as learners’ confidence grows.


When learners know the lexis, the grammar is proceduralised, the pronunciation is modelled, and the task is clear, they can focus on flow. That is the true goal of fluency training—not speed for its own sake, but smooth, confident, and intelligible communication.

Conclusion


Speaking is not a single act but a series of fast-paced, overlapping cognitive operations. Each sub-process—conceptualisation, formulation, phonological encoding, articulation, and monitoring—presents unique challenges for L2 learners, particularly at the lower-intermediate and intermediate levels. By recognising these challenges and targeting instruction accordingly, we can build learners’ capacity to speak fluently, accurately, and confidently.


Whwther you embrace EPI or not, a process-based approach, beginning with chunks and leading toward fluent, spontaneous production, provides a roadmap for overcoming cognitive bottlenecks and enabling real communicative competence.