Neuronal responses to phonemes in continuous speech. (A) The spectrograms of all / / vowel exemplars were extracted and averaged to obtain one grand average auditory spectrogram (bottom left). In this and following average spectrogram plots, red areas indicate regions of higher than average energy and blue regions indicate weaker than average energy. The corresponding PSTH response to / / was computed by averaging neural spike rates over the same time windows (bottom right). (B) The spectro-temporal receptive field (STRF) of a neuron as measured by normalized reverse correlation. Red areas indicate stimulus frequencies and time lags correlated with an increased response, and blue areas indicate stimulus features correlated with a decreased response. The neuron’s BF was defined to be the excitatory peak of the STRF (red arrow). The modulation transfer function (MTF) is computed by taking the absolute value of the 2D Fourier transform of the STRF. We then collapse along the temporal or spectral dimensions (known also as the rate and scale) to obtain the purely spectral (sMTF) or temporal (tMTF) modulation transfer functions. The best scale (proportional to the inverse of bandwidth) of a STRF was defined as the centroid of the sMTF (in “cycles/octave”), whereas “speed” or best rate of the STRF is defined as the centroid of the tMTF (in Hz). The choice of centroid for best-scale parameter results in a compressed range but it does not affect the ordering of neurons along this dimension. (C) Average auditory spectra of three phonemes (/ɔ/, /ʃ/, /m/). Below each spectrogram is the PSTH response of five example neurons (labeled N1–N5). (D) The STRFs of these neurons indicate a diversity of spectro-temporal tuning properties.
Population response to vowels. (A) I. Average auditory spectrogram of 12 vowels organized approximately according to their open-closed and front-back articulatory features. The arrows at top indicate the degree of these features, with arrow “tips” representing minima (mid or central) and midpoints representing maxima. For example /ʌ/ is maximally open, but is neutral (central) on the front/back axis. Note also that the axes are presumed to loop around the page from right to left (dashed ends joining) creating a circular representation (II, III, IV): Average PSTH responses of 90 neurons to each vowel. Within each heat map, each row indicates the average response of a single neuron to the corresponding phoneme. Red regions indicate strong responses, and blue regions indicate weak responses. The average PSTH responses are sorted by neurons’ best frequency (II), best scale (III) and best rate (IV) to emphasize the role of that parameter in the encoding of each vowel. (Details of the analysis and generation of these plots are given in Sec. II ). (B) I. Each vowel is plotted at the centroid frequency, rate and scale of its average neuronal population response. The centroid values are calculated from the average PSTH responses sorted by the corresponding parameter (2A). “Open” vowels are shown in red, “Closed” vowels in blue, “Front” vowels with light font and “Back” vowels with dark. To visualize the contribution of each tuning property to vowel discrimination, the location of each vowel is also shown collapsed in 2–D plots of (II) scale-rate, (III) rate-BF and (IV) scale-BF. All other details of the analysis and generation of these plots are given in Sec. II (Experimental Procedures).
Population response to consonants. (A) I. Average spectrogram of 15 consonants phonemes grouped as six plosives, six fricatives and three nasals. Each of the plosive and fricative-groups contains three voiced and three unvoiced phonemes (see arrows at top). (II, III, IV) Average PSTH responses of the neural population to each consonant, plotted as in Fig. 2(A) . The average PSTH responses are sorted by neurons’ best frequency (III), best scale (II) and best rate (IV) to emphasize the role of that parameter in the encoding of consonants. (All other details of the analysis and generation of these plots are given in Sec. II ). (B) Each consonant is placed at the centroid frequency, rate and scale of its neuronal population response, measured from the corresponding PSTH responses (A). Plosive phonemes are plotted in red, fricatives in blue and nasals in green. The locus of each consonant is also shown collapsed in 2D plots of (II) scale-rate, (III) rate-BF and (IV) scale-BF. All other details of the analysis and generation of these plots are given in Sec. II (Experimental Procedures).
Phoneme classification based on the population response. Classification masks for three unvoiced plosives (/p/, /t/, /k/) and three unvoiced fricatives (/f/, /s/, /ʃ/) sorted by neurons’ best frequency (A), best scale (B) and best rate (C). Gray scale indicates the importance of the presence (black regions) or absence (white regions) of neural response for the classification of that phoneme. The output of each phoneme classifier is a scalar, computed as the sum of the population PSTH multiplied by the mask. Thus the order of the mask/PSTH is irrelevant to the output of the classifier.
Neural and human phoneme confusions, and phonemes acoustic similarity. Consonant confusion matrices from neural phoneme classifiers (left panels) and human psychoacoustic studies (Ref. 17 ) (middle panels). Gray scale indicates the probability of reporting a particular phoneme (column) for an input phoneme (row). (Right panels) The acoustic similarity between phoneme pairs defined as the Euclidian distance between their average auditory spectrograms. (A) Confusion matrices and phonemic distances for unvoiced consonants. Dashed lines separate the plosives /p/, /t/, /k/ from fricatives /f/, /s/, /ʃ/. (B) Confusion matrices and phonemic distances for voiced consonants. Dashed lines separate the plosives /b/, /d/, /ɡ/ from fricatives /v/, /ð/, /z/ and the nasal consonants /m/ and /n/ from the rest.
Joint distribution of neural parameters. Joint distributions of best frequency, best rate (A), best frequency, best scale (B) and best rate, best scale (C) of 90 neurons.
Phoneme confusions from 90 neurons. (Left column) Consonant confusion matrices from neural phoneme classifiers using entire population of 90 neurons and of speech. (Right column) human psychoacoustic studies. Gray scale indicates the probability of reporting a particular phoneme (column) for an input phoneme (row). (a) Confusion matrices for unvoiced consonants. Dashed lines separate the plosives /p/, /t/, /k/ from fricatives /f/, /s/, /ʃ/. (b) Confusion matrices for voiced consonants. Dashed lines separate the plosives /b/, /d/, /g/ from fricatives /v/, /ð/, /z/ and the nasal consonants /m/ and /n/ from the rest.
Dependence of phoneme classification accuracy to the number of neurons. Classification accuracy as a function of the number of neurons used by the classifier. The dashed line indicates chance performance (7% for 14 phonemes) (see Sec. II for details).
Article metrics loading...
Full text loading...