1887
banner image
No data available.
Please log in to see this content.
You have no subscription access to this content.
No metrics data to plot.
The attempt to load metrics for this article has failed.
The attempt to plot a graph for these metrics has failed.
Phoneme representation and classification in primary auditory cortex
Rent:
Rent this article for
USD
10.1121/1.2816572
/content/asa/journal/jasa/123/2/10.1121/1.2816572
http://aip.metastore.ingenta.com/content/asa/journal/jasa/123/2/10.1121/1.2816572
View: Figures

Figures

Image of FIG. 1.
FIG. 1.

Neuronal responses to phonemes in continuous speech. (A) The spectrograms of all / / vowel exemplars were extracted and averaged to obtain one grand average auditory spectrogram (bottom left). In this and following average spectrogram plots, red areas indicate regions of higher than average energy and blue regions indicate weaker than average energy. The corresponding PSTH response to / / was computed by averaging neural spike rates over the same time windows (bottom right). (B) The spectro-temporal receptive field (STRF) of a neuron as measured by normalized reverse correlation. Red areas indicate stimulus frequencies and time lags correlated with an increased response, and blue areas indicate stimulus features correlated with a decreased response. The neuron’s BF was defined to be the excitatory peak of the STRF (red arrow). The modulation transfer function (MTF) is computed by taking the absolute value of the 2D Fourier transform of the STRF. We then collapse along the temporal or spectral dimensions (known also as the and ) to obtain the purely or modulation transfer functions. The (proportional to the inverse of bandwidth) of a STRF was defined as the centroid of the sMTF (in “cycles/octave”), whereas “speed” or of the STRF is defined as the centroid of the tMTF (in Hz). The choice of for best-scale parameter results in a compressed range but it does not affect the ordering of neurons along this dimension. (C) Average auditory spectra of three phonemes (/ɔ/, /ʃ/, /m/). Below each spectrogram is the PSTH response of five example neurons (labeled N1–N5). (D) The STRFs of these neurons indicate a diversity of spectro-temporal tuning properties.

Image of FIG. 2.
FIG. 2.

Population response to vowels. (A) I. Average auditory spectrogram of 12 vowels organized approximately according to their open-closed and front-back articulatory features. The arrows at top indicate the of these features, with arrow “tips” representing minima (mid or central) and midpoints representing maxima. For example /ʌ/ is maximally open, but is neutral (central) on the front/back axis. Note also that the axes are presumed to loop around the page from right to left (dashed ends joining) creating a circular representation (II, III, IV): Average PSTH responses of 90 neurons to each vowel. Within each heat map, each row indicates the average response of a single neuron to the corresponding phoneme. Red regions indicate strong responses, and blue regions indicate weak responses. The average PSTH responses are sorted by neurons’ best frequency (II), best scale (III) and best rate (IV) to emphasize the role of that parameter in the encoding of each vowel. (Details of the analysis and generation of these plots are given in Sec. II ). (B) I. Each vowel is plotted at the centroid frequency, rate and scale of its average neuronal population response. The centroid values are calculated from the average PSTH responses sorted by the corresponding parameter (2A). “Open” vowels are shown in red, “Closed” vowels in blue, “Front” vowels with font and “Back” vowels with . To visualize the contribution of each tuning property to vowel discrimination, the location of each vowel is also shown collapsed in 2–D plots of (II) scale-rate, (III) rate-BF and (IV) scale-BF. All other details of the analysis and generation of these plots are given in Sec. II (Experimental Procedures).

Image of FIG. 3.
FIG. 3.

Population response to consonants. (A) I. Average spectrogram of 15 consonants phonemes grouped as six plosives, six fricatives and three nasals. Each of the plosive and fricative-groups contains three voiced and three unvoiced phonemes (see arrows at top). (II, III, IV) Average PSTH responses of the neural population to each consonant, plotted as in Fig. 2(A) . The average PSTH responses are sorted by neurons’ best frequency (III), best scale (II) and best rate (IV) to emphasize the role of that parameter in the encoding of consonants. (All other details of the analysis and generation of these plots are given in Sec. II ). (B) Each consonant is placed at the centroid frequency, rate and scale of its neuronal population response, measured from the corresponding PSTH responses (A). Plosive phonemes are plotted in red, fricatives in blue and nasals in green. The locus of each consonant is also shown collapsed in 2D plots of (II) scale-rate, (III) rate-BF and (IV) scale-BF. All other details of the analysis and generation of these plots are given in Sec. II (Experimental Procedures).

Image of FIG. 4.
FIG. 4.

Phoneme classification based on the population response. Classification masks for three unvoiced plosives (/p/, /t/, /k/) and three unvoiced fricatives (/f/, /s/, /ʃ/) sorted by neurons’ best frequency (A), best scale (B) and best rate (C). Gray scale indicates the importance of the presence ( regions) or absence ( regions) of neural response for the classification of that phoneme. The output of each phoneme classifier is a scalar, computed as the sum of the population PSTH multiplied by the mask. Thus the order of the mask/PSTH is irrelevant to the output of the classifier.

Image of FIG. 5.
FIG. 5.

Neural and human phoneme confusions, and phonemes acoustic similarity. Consonant confusion matrices from neural phoneme classifiers (left panels) and human psychoacoustic studies (Ref. 17 ) (middle panels). Gray scale indicates the probability of reporting a particular phoneme (column) for an input phoneme (row). (Right panels) The acoustic similarity between phoneme pairs defined as the Euclidian distance between their average auditory spectrograms. (A) Confusion matrices and phonemic distances for unvoiced consonants. Dashed lines separate the plosives /p/, /t/, /k/ from fricatives /f/, /s/, /ʃ/. (B) Confusion matrices and phonemic distances for voiced consonants. Dashed lines separate the plosives /b/, /d/, /ɡ/ from fricatives /v/, /ð/, /z/ and the nasal consonants /m/ and /n/ from the rest.

Image of FIG. 6.
FIG. 6.

Joint distribution of neural parameters. Joint distributions of best frequency, best rate (A), best frequency, best scale (B) and best rate, best scale (C) of 90 neurons.

Image of FIG. 7.
FIG. 7.

Phoneme confusions from 90 neurons. (Left column) Consonant confusion matrices from neural phoneme classifiers using entire population of 90 neurons and of speech. (Right column) human psychoacoustic studies. Gray scale indicates the probability of reporting a particular phoneme (column) for an input phoneme (row). (a) Confusion matrices for unvoiced consonants. Dashed lines separate the plosives /p/, /t/, /k/ from fricatives /f/, /s/, /ʃ/. (b) Confusion matrices for voiced consonants. Dashed lines separate the plosives /b/, /d/, /g/ from fricatives /v/, /ð/, /z/ and the nasal consonants /m/ and /n/ from the rest.

Image of FIG. 8.
FIG. 8.

Dependence of phoneme classification accuracy to the number of neurons. Classification accuracy as a function of the number of neurons used by the classifier. The dashed line indicates chance performance (7% for 14 phonemes) (see Sec. II for details).

Loading

Article metrics loading...

/content/asa/journal/jasa/123/2/10.1121/1.2816572
2008-02-01
2014-04-21
Loading

Full text loading...

This is a required field
Please enter a valid email address
752b84549af89a08dbdd7fdb8b9568b5 journal.articlezxybnytfddd
Scitation: Phoneme representation and classification in primary auditory cortex
http://aip.metastore.ingenta.com/content/asa/journal/jasa/123/2/10.1121/1.2816572
10.1121/1.2816572
SEARCH_EXPAND_ITEM