You are in: Home » Research at the Unit » Speech and Language
Recognising continuous speech
A central theme running through much of our research on spoken word recognition has been the problem of how we recognise continuous speech. In contrast to written language, where there are white spaces between the words, spoken language contains few reliable cues to the location of word boundaries. How do listeners solve the problem of recognising spoken words without first knowing where the words actually are in the input? How do listeners avoid thinking that they hear 'cat', 'a', and 'log' when someone says the word 'catalogue'? We have been tackling these, and related problems in spoken word recognition by a mixture of conventional experimental work, both on English and on other languages, and by the development of a large scale connectionist model of spoken word recognition called Shortlist (Norris, 1994).
When we listen to someone speaking we get the impression of hearing a discrete series of words. However, computational models of human speech recognition such as Shortlist and TRACE (McClelland & Elman, 1986) assume that listeners unconsciously have to consider 'cat', 'a', and 'log' - and maybe 'cattle' - when they hear 'catalogue'. They then have to discover which of these alternatives forms the most likely interpretation of what they hear. This process of discovering the best interpretation is often described as lexical competition. Different candidates words compete with each other to determine the optimum interpretation.
The Shortlist model (Norris, 1994; Norris & McQueen, 2008)
Shortlist is a model of how people recognise words in continuous speech - that is, normal conversational speech where there are no gaps or pauses between the words. Shortlist started as a connectionist model using an interactive activation network (Norris, 1994). Shortlist served several purposes. First it was the first computational model to be able to run simulations using a realistic vocabulary. This meant that we could simulate data from experiments by feeding the model exactly the same words as used in the experiments, rather that simply using a few 'toy' illustrative examples. Second, it was able to demonstrate the viability of a model of speech recognition with a completely bottom-up flow of information. In contrast to other psychological models there was no feedback from the lexicon to pre-lexical stages of speech processing., Over the years the model has been extended to provide simulations of a wider range of data on the way people segment speech into individual words (e.g. the Possible Word Constraint of Norris, McQueen, Cutler and Butterfield, 1997).
Shortlist B (Norris & McQueen, 2008)
In a more recent development, Norris & McQueen (2008) have produced a new version of the model (Shortlist B - B for Bayesian) which shows how it can be derived from the simple principle that people behave as though they are deriving an optimal interpretation of the speech input using all the available information - people are behaving as though they recognise speech by a process of Bayesian inference. This move results in a far simpler model because many of the features and parameters in the original model now follow inevitably from the principal of optimality. For further details see Dennis Norris's research pages
Phonological and phonetic cues to speech segmentation
The Possible-Word Constraint
The Shortlist model has been extended to explain how listeners can use various phonological cues to help them identify word boundaries. For example, listeners can make use of the fact that most content words in English begin with a strong syllable (the first syllable of "cabbage" is strong, the first syllable of "cigar" is weak). The implicit use of such prosodic information by listeners has been referred to as the Metrical Segmentation Strategy (Cutler and Norris, 1988). Listeners can also make use of knowledge that word boundaries shouldn't be placed so that they leave 'residues' without a vowel. So, for example, when listeners hear a nonsense word like "seesh" they find it much harder to realise that it contains the word "see" than when they hear "seeshub". To break "sheesh" into "see" and "sh" leaves a residue ("sh") that couldn't possibly be a real English word, because it doesn't contain a vowel. On the other hand, "shub" does contain a vowel, and might possibly be (or become) an English word. This is known as the Possible Word Constraint (Norris, McQueen, Cutler and Butterfield, 1997).
Further work has shown that the PWC is driven by solely by whether or not the 'residue' contains a vowel, and that the identity of the vowel is unimportant. Norris, Cutler, McQueen, Butterfield, and Kearns (2001) showed that the PWC operated even when the vowel in the residue is a schwa, or a lax vowel. Syllables with schwa or lax vowels cannot be real words in English. This suggests that the PWC is a language-universal strategy, and is guided by consideration of what might be a word in any language, not by whether the residue could be a word in the listener's own language. A strategy like the PWC should be of great value in acquiring language as well as perceiving language as it would provide the infant with valuable information as to the likely location of word boundaries. In accord with this, Johnson, Jusczyk, Cutler and Norris (2003) have found evidence that 12-month-old babies behave in accord with the predictions of the PWC.
Prosodic cues to lexical segmentation
One potential source of information that listeners may be able to use in word segmentation comes from durational differences between the syllables in monosyllabic and bisyllabic words. For example, Lehiste (1972) reports significant shortening of the syllable [slip] in words like sleepy and sleepiness. In a series of experiments Davis, Marslen-Wilson, & Gaskell (2002) have shown that listeners can use this information to differentiate between embedded words and longer competitors in spoken sentences. Stimulus sentences for these experiments contained monosyllabic words or frequency-matched bisyllables that contained the monosyllable as the initial syllable (such as the pair cap and captain), placed in a spoken sentence frame ("The soldier saluted the flag with his cap tucked/with his captain looking"). An initial gating study showed that duration differences (and possibly other prosodic cues) were able to bias, as appropriate, towards either monosyllabic or bisyllabic interpretations before the end of the syllable. A subsequent series of cross-modal repetition-priming experiments examined in detail the activation of onset-embedded words and longer competitors as the critical syllable and its following context was heard. These showed immediate effects on the relative level of activation of lexical hypotheses according to their compatibility with bottom-up prosodic constraints. This is a significant result which not only underlines the role of fine-grained acoustic detail in the perception and segmentation of connected speech but also suggests that onset-embedding may be a less serious computational problem than previous theorists have suggested.
Phonological variation in lexical and sentential context
Regular phonological variation, for example, assimilation of place of articulation, is an important aspect of fluent speech production and recognition, and can be used on-line by the listener as a cue to segmentation (Gaskell & Marslen-Wilson, 1996, 1998). We extended this research to look at the case where phonological variation could generate a natural case of lexical ambiguity involving two normally unambiguous words. For example, in running speech, "a quick run picks you up" can become confusable with "a quick rum picks you up". This is because the coronal place of articulation of the final consonant of "run" assimilates to the labial place of articulation of the initial consonant of "pick", yielding a surface form which is closer to "rum" than to "run". We examined the perception of this kind of change in Gaskell & Marslen-Wilson (2001a; see also Gaskell & Marslen-Wilson, 1999b; Gaskell, 2000a, 2000b). The results supported previous research in demonstrating the importance of evaluating phonological alternations in their phonological context, but also demonstrated a strong effect of sentential context on resolution of this kind of ambiguity. This provides a bridging link between models of spoken word recognition and lexical ambiguity resolution. A second strand of research in this area examined the effects of resyllabification and phonological liaison in the perception of French. The dominant model of speech perception for French emphasises the role of the syllable as a segmentation and access unit. When vowel initial words are embedded in fluent speech the syllable model predicts that natural resyllabification processes such as liaison should impede recognition of these words. Our research demonstrated the opposite effect (Gaskell, Spinelli & Meunier, in press; Spinelli, Gaskell & Meunier, 2000), and showed that, as in English, sensitivity to the phonological context of regular variations in speech is a crucial component of the recognition system. The results suggest that the role of the syllable in models of speech processing for Romance languages should be refined.

