skip to primary navigation skip to content
*** PLEASE READ ************************************** THIS PAGE HAS BEEN IMPORTED FROM THE OLD SITE. FORMATTING IS MAINTAINED BY AN EXTERNAL STYLESHEET. WHEN YOU EDIT THIS PAGE, YOU MAY WANT TO REMOVE THE REFERENCE TO THIS STYLESHEET AND UPDATE THE FORMATTING. ******************************************************

An Introduction to Noise-Vocoded Speech
Matt Davis

MRC Cognition and Brain Sciences Unit
Chaucer Road
Cambridge CB2 2EF.

In recent work we have investigated the processes by which listeners learn to understand speech that has been noise-vocoded. This is a form of distortion that was developed by Bob Shannon (Shannon et al., 1995) to simulate the experience of hearing speech transduced by a cochlear implant.

I created this www page so that people can hear examples of noise-vocoded sentences, and experience at first hand some of the phenomena that are described in our recent paper:

Davis, M. H., Hervais-Adelman, A. & Taylor, K., McGettigan, C., & Johnsrude, I. S. (2005) "Lexical information drives perceptual learning of distorted speech: evidence from the comprehension of noise-vocoded sentences" Journal of Experiment Psychology: General, 134(2), 222-241. PDF.

1) What is noise-vocoded speech?

Noise-vocoded speech is created by (1) dividing the speech signal into logarithmically-spaced frequency bands. In each frequency band the amplitude envelope (2) is extracted, and then (3) used to modulate noise in the same frequency band. Finally (4) each of the bands of noise are recombined to create the noise-vocoded sentence.

The figure below illustrates the processing stages involved in generating a 6-band noise-vocoded version of an existing spoken sentence. If you click on parts of the figure, you can hear the speech signal at different stages.

The example vocoded sentence shown in the figure was created by dividing the speech signal into six logarithmically-spaced frequency bands (cf. Greenwood, 1990). The boundary and centre frequencies of each band are shown in the table below, along with icons to click on to play sentences after filtering into each frequency band. You may notice that speech is quite intelligible even when filtered into a fairly narrow frequency range. However, this depends on access to detailed spectral information - once noise-vocoded (i.e. once amplitude information is used to modulate a noise source) the sounds in each band are entirely unintelligible. Only when the bands are recombined can speech be perceived.

 
Frequency (/hz)
Band
Min
Centre
Max
Clear Speech
Noise-vocoded speech
1

50

140
229
2
229
394
558
3
558
860
1161
4
1161
1713
2265
5
2265
3278
4290
6
4290
6145
8000
 

6-Bands
Combined

2) Varying the number of bands in vocoded speech:

One important variable that determines the amount that can be understood from a vocoded sentence is the number of frequency bands that are used in synthesising the distorted speech. As should be apparent by listening to the sounds embedded in the table below, sentences synthesised with fewer than 4 bands are extremely difficult to understand. Sentences synthesised with 10 or more bands can be readily understood with very little practice and without knowing the content of the speech.

For some of the example sentences try listening to a 1 band version of the sentence, then gradually increase the number of bands in the sentence. Try to decide how many bands you would need to clearly understand the sentence. For other sentences, start with a 15 band version, then reduce the number of bands until you stop being able to understand the sentence. You might notice a difference between the number of bands that you need to start or stop understanding speech. This is an example of 'pop-out': noise-vocoded speech sounds much clearer when you know the identity of the sentence.

Perhaps the clearest case of pop-out occurs if you listen to a vocoded sentence before and immediately after you hear the same sentence in vocoded form. It is likely that the vocoded sentence will sound a lot clearer when you know the identity of that sentence.

 
Number of bands in noise-vocoded speech
Sentence
1
2
3
4
6
8
10
15
Clear Speech

A

B

C

D

3) Different kinds of vocoded sentences:

One factor the affects how easily people can learn to understand vocoded sentences is whether or not the sentences are meaningful and contain real English words. We have explored this by using four different kinds of sentences. You can hear examples of these by clicking on the icons in the table below. In each case, you will hear the sentence three times, first distorted, then as clear speech, then distorted once more.

Sentence Type
Example
DCD Recording

Normal Prose

it was a sunny day and the children were going to the park

Syntactic Prose
(with randomly chosen content words)

it was a gloomy hand and the women were getting to the inch

Jabberwocky
(with function words, but no familiar content words)

it was a nussy rit and the
rentshil were duafing to the tand

Nonword Sentences
(without any familiar English words)

ut fode oo nussy rit ef su
rentshil nu duaft ge fe tand

In an experiment we played people 20 vocoded sentences of one of these four types, repeating each sentence three times. We then tested their comprehension of normal sentences. Those listeners that heard sentences containing real English words (normal and syntactic prose), were better able to understand vocoded speech than those trained with Jabberwocky or Nonword sentences. Those listeners trained with nonword sentences were no better than listeners who had never heard vocoded speech before.

4) Making your own noise-vocoded speech:

All of the distorted speech on this page was created using the excellent software package, Praat created by Paul Boersma and David Weenink. The scripts that create noise-vocoded speech were modified from an original script written by Chris Darwin. Thanks to Paul and Chris for their help and support in using Praat. I would be happy to supply the Praat scripts that were used for making the examples on this page, you should email me for more information.

5) References:

Greenwood, D. D. (1990). A cochlear frequency-position function for several species - 29 years later. Journal of the Acoustical Society of America, 87(6), 2592-2605.

Shannon, R. V., Zeng, F. G., Kamath, V., Wygonski, J., & Ekelid, M. (1995). Speech recognition with primarily temporal cues. Science, 270, 303-304.


This page was created on 11th December 2003, last modified 17th October 2006. Comments and suggestions to matt.davis@mrc-cbu.cam.ac.uk.

genesis();