Learning to recognise embedded words in connected speech
Most current models of spoken word recognition such as TRACE (McClelland and Elman, 1986) postulate competition mechanisms (implemented by mutual inhibition) to account for this delayed recognition. In such models, lexical items that share segments will inhibit each other, irrespective of whether the items share word boundaries. Thus information that rules out a longer lexical hypothesis will boost the activation of units representing embedded words (by reducing inhibition), allowing their recognition.
Investigations using simple recurrent networks (Norris, 1990; Gaskell and Marslen-Wilson, in press), have shown how cohort effects during the recognition of single words, previously implemented as mutual inhibition between lexical items, can be produced by a recurrent network trained to identify isolated words. By mapping sequences of phonemes to a static representation of the current word, the networks displayed effects of competitor environment as a direct consequence of the probabilistic nature of their training regime, with words only becoming fully activated when they become unique. However since these networks only activate a representation of the current word in the speech stream, they are unable to use following context to identify embedded words. Networks described by Content and Sternon (1994), extend this approach by adding units that represent the preceding word, thereby allowing the network to implement the delayed recognition described by Grosjean. However since all of these networks are trained on a corpus in which the target output changes at the boundary between words, they must be provided with an explicitly segmented training set.
The work described here investigate whether the approach used by Norris, (mapping a temporally changing input to a static output representation) can be used to get a network to learn to recognize words in connected speech, without requiring trained on a pre-segmented corpus. We were also interested in whether such an approach would allow recognition of embedded words to be delayed until following context uniquely distinguishes them from longer competitors, without any explicitly implemented mechanism of lexical competition.
A simple recurrent network was trained to map a continuous stream of phonetic features to a simple, localist representation of all the words in the current phrase or utterance. The network is therefore trained on the identities of all the words in a sequence, without being provided with any information about the order in which those words occur, or where boundaries between lexical items may occur in the speech stream.
In simulations with a small, artificial corpus we show that the network is able to use regularities between phoneme sequences and units in the output representation to segment the speech stream into lexical items; learning to identify words in the training set. The pattern of activation of these words displays the expected cohort effects, with full activation being delayed until words can be uniquely distinguished from all other lexical items. This therefore constitutes a demonstration that, at least for small training sets, correspondences between the speech stream and lexical representations can be used to segment connected speech into words.
We also demonstrate, by training this network with on a corpus containing onset embedded words, that this approach allows for delayed recognition as observed by Grosjean (1985). For training sets containing pairs of words such as cap and captain, the network learns to use segments following the offset of cap to distinguish embedded words from longer competitors. Even in cases where mismatch between the following context and the longer word comes more than one segment after the offset of the embedded word (equivalent to a case like cap tucked, where the longer competitor captain is only ruled out in the vowel of the second syllable), the network is still able to use following context to revise earlier hypotheses.
The recurrent network that we have described learns to delay commitment on the identity of words in the speech stream until the input uniquely identifies a single lexical item. Regardless of whether disambiguation occurs within the word itself (cohort effects) or following a words offset (as is the case for embedded words), the network provides the optimal solution to the conflicting constraints of speed and reliability. This pattern of early activation and delayed recognition has previously been assumed to reflect the operation of inhibitory competition between lexical level representations in models such as TRACE. By making the target of the recognition process a level representing information larger than a single word, the network we have described here produces this behavior as a direct consequence of the probabilistic nature of the network’s training regime.
Having developed a network capable of learning to identify lexical units from an unsegmented training corpus, we then used the model as a test bed for investigating the effects of different sources of information and secondary mappings on the learning of lexical segmentation. We further demonstrate the role of auto-encoding mappings (i.e. predicting the next segment) in boot-strapping the segmentation of lexical items (cf. Cairns et al., 1994) and also report investigations of the networks processing of supra-segmental cues to word length (such as syllable duration, Lehiste 1972), drawing comparisons with recent experimental work on the recognition of embedded words in connected speech. (Davis et. al. 1997).
Cairns, P., Shillcock, R., Chater, N., & Levy, J. (1994). Lexical segmentation: the role of sequential statistics in supervised and unsupervised models. In Ram & Eiselt (Eds.), Proceedings of the sixteenth annual conference of the cognitive science society. Hillsdale, NJ: Lawrence Erlbaum Associates.
Content, A., & Sternon, P. (1994). Modeling retroactive context effects in spoken word recognition with a simple recurrent network. In Ram & Eiselt (Eds.), Proceedings of the sixteenth annual conference of the cognitive science society, Hillsdale, NJ: Lawrence Erlbaum.
Davis, M, Gaskell, M. G, & Marslen-Wilson, W.D. (submitted) Ambiguity and competition in lexical segmentation. Paper submitted to the 19th conference of the cognitive science society.
Gaskell, M. G., & Marslen-Wilson, W.D. (in press). Integrating form and meaning: a distributed model of speech perception. Language and Cognitive Processes.
Grosjean, F. (1985). The recognition of words after their acoustic offset: Evidence and implications. Perception and Psychophysics, 38(4), 299-310.
Lehiste, I. (1972). The timing of utterances and linguistic boundaries. Journal of the Acoustical Society of America, 51(6), 2018-2024.
Luce, P. A. (1986). A computational analysis of uniqueness points in auditory word recognition. Perception and Psychophysics, 39, 155-158.
McClelland, J. L. & Elman, J. L. (1986) The TRACE model of speech perception. Cognitive Psychology, 18, 1-86.
Norris, D. (1990). A dynamic-net model of human speech recognition. In G. T. M. Altmann (Eds.), Cognitive Models of Speech Processing Cambridge, MA: MIT Press.