banner image
No data available.
Please log in to see this content.
You have no subscription access to this content.
No metrics data to plot.
The attempt to load metrics for this article has failed.
The attempt to plot a graph for these metrics has failed.
Human phoneme recognition depending on speech-intrinsic variabilitya)
a)Parts of this work were presented at the Eighth Annual Conference of the International Speech Communication Association (Interspeech 2007, Antwerp).
Rent this article for


Image of FIG. 1.
FIG. 1.

Average DFT power spectrum of stationary masking noise signal (thick black line) and long-term spectra of OLLO utterances. Individual long-term spectra for central consonant and vowel phonemes are shown in the left and right panel, respectively. Mel-scaling with labels in Hz has been chosen for the frequency axis.

Image of FIG. 2.
FIG. 2.

Phoneme recognition results (% correct) with standard errors, depending on speech-intrinsic variabilities such as speaking rate and style (Set , left panel) and dialect (Set , middle panel), and on additive masking noise (right panel). Results for Sets and were obtained in listening experiments at −6.2 dB SNR in speech-shaped noise. Variabilities are sorted by average recognition accuracies, which are broken down into consonant and vowel scores.

Image of FIG. 3.
FIG. 3.

Average recognition rates depending on speaking variability (left panel), SNR (middle panel) and dialect or accent (right panel). The dashed horizontal lines show the difference between logatomes in the ‘normal’ condition and the average performance of the remaining variabilities. Dotted lines denote differences between the ‘no dialect’ condition and the remaining dialects. By projecting these differences on the middle panel, changes in speaking variability may be expressed in terms of SNR.

Image of FIG. 4.
FIG. 4.

Left panel: Relation between the phoneme-noise distance and recognition rates for consonants and vowels. Next to each data point, the SAMPA transcript of the according phoneme is denoted. The right panel shows the dependency of phoneme-phoneme distance and error rates obtained from symmetrized confusion matrices. For each phoneme, several data points are shown which correspond to confusions with ‘spectral neighbors’, i.e., phonemes that were spectrally closest (marker ‘o’) and 2nd closest (markers ‘◻’) to the presented phoneme. Data points in light gray represent data from individual listeners.

Image of FIG. 5.
FIG. 5.

Relative information transmission depending on speaking variability such as speaking rate and effort (left panel) and dialect and accent (right panel) for selected articulatory features. The error bars denote the standard error across listeners.

Image of FIG. 6.
FIG. 6.

Comparison of average consonant recognition scores with results from Sroka and Braida (2005) [SB05], Phatak and Allen (2007) [PA07], Grant and Walden (1996) [GW96] and Miller and Nicely (1955) [MN55]. Filled symbols denote results obtained with the OLLO database. Recognition scores for Sets and for ‘normal’ speaking style and ‘no dialect’ condition include a single SNR and appear as single data points.

Image of FIG. 7.
FIG. 7.

Phoneme confusion matrix obtained in an ASR experiment. In this row-normalized CM, black color denotes unity and white color corresponds to chance performance. The eight consonant phonemes that were selected to be included in the OLLO database based on this experiment are marked with arrows.


Generic image for table

Properties of the OLLO speech database.

Generic image for table

Subsets of the OLLO database used for human listening tests. The sets are used to analyze the influence of variabilities such as speaking rate, effort and style (Set ), dialect or accent (Set ) and SNR (Set ). Each set contains at least 150 different logatomes with 24 central phonemes. For sets with more than one speaker, the gender is equally distributed.

Generic image for table

Phonetic features of eleven consonants. The articulatory feature ‘voicing’ can assume two feature values (voiced and unvoiced). For manner of articulation, consonants are categorized as stop, fricative or nasal. Possible values for place of articulation are anterior, medial, and posterior.

Generic image for table

Confusion matrix for consonant phonemes obtained with Set (−6.2 dB SNR), pooled over all speaking styles, listeners and speakers in this test set. The average recognition rate is 67.7%. Rows (which denote presented phonemes) are normalized and rounded, so that each row adds up to approximately 100% (corresponding to 720 presentations).

Generic image for table

CM for vowel phonemes, obtained with Set (SNR −6.2 dB SNR). The average recognition rate is 80.5%. Rows are normalized, with 100% corresponding to 1152 presentations. For a detailed description, see Table IV.


Article metrics loading...


Full text loading...

This is a required field
Please enter a valid email address
752b84549af89a08dbdd7fdb8b9568b5 journal.articlezxybnytfddd
Scitation: Human phoneme recognition depending on speech-intrinsic variabilitya)