Human phoneme recognition depending on speech-intrinsic variabilitya)
Average DFT power spectrum of stationary masking noise signal (thick black line) and long-term spectra of OLLO utterances. Individual long-term spectra for central consonant and vowel phonemes are shown in the left and right panel, respectively. Mel-scaling with labels in Hz has been chosen for the frequency axis.
Phoneme recognition results (% correct) with standard errors, depending on speech-intrinsic variabilities such as speaking rate and style (Set , left panel) and dialect (Set , middle panel), and on additive masking noise (right panel). Results for Sets and were obtained in listening experiments at −6.2 dB SNR in speech-shaped noise. Variabilities are sorted by average recognition accuracies, which are broken down into consonant and vowel scores.
Average recognition rates depending on speaking variability (left panel), SNR (middle panel) and dialect or accent (right panel). The dashed horizontal lines show the difference between logatomes in the ‘normal’ condition and the average performance of the remaining variabilities. Dotted lines denote differences between the ‘no dialect’ condition and the remaining dialects. By projecting these differences on the middle panel, changes in speaking variability may be expressed in terms of SNR.
Left panel: Relation between the phoneme-noise distance and recognition rates for consonants and vowels. Next to each data point, the SAMPA transcript of the according phoneme is denoted. The right panel shows the dependency of phoneme-phoneme distance and error rates obtained from symmetrized confusion matrices. For each phoneme, several data points are shown which correspond to confusions with ‘spectral neighbors’, i.e., phonemes that were spectrally closest (marker ‘o’) and 2nd closest (markers ‘◻’) to the presented phoneme. Data points in light gray represent data from individual listeners.
Relative information transmission depending on speaking variability such as speaking rate and effort (left panel) and dialect and accent (right panel) for selected articulatory features. The error bars denote the standard error across listeners.
Comparison of average consonant recognition scores with results from Sroka and Braida (2005) [SB05], Phatak and Allen (2007) [PA07], Grant and Walden (1996) [GW96] and Miller and Nicely (1955) [MN55]. Filled symbols denote results obtained with the OLLO database. Recognition scores for Sets and for ‘normal’ speaking style and ‘no dialect’ condition include a single SNR and appear as single data points.
Phoneme confusion matrix obtained in an ASR experiment. In this row-normalized CM, black color denotes unity and white color corresponds to chance performance. The eight consonant phonemes that were selected to be included in the OLLO database based on this experiment are marked with arrows.
Properties of the OLLO speech database.
Subsets of the OLLO database used for human listening tests. The sets are used to analyze the influence of variabilities such as speaking rate, effort and style (Set ), dialect or accent (Set ) and SNR (Set ). Each set contains at least 150 different logatomes with 24 central phonemes. For sets with more than one speaker, the gender is equally distributed.
Phonetic features of eleven consonants. The articulatory feature ‘voicing’ can assume two feature values (voiced and unvoiced). For manner of articulation, consonants are categorized as stop, fricative or nasal. Possible values for place of articulation are anterior, medial, and posterior.
Confusion matrix for consonant phonemes obtained with Set (−6.2 dB SNR), pooled over all speaking styles, listeners and speakers in this test set. The average recognition rate is 67.7%. Rows (which denote presented phonemes) are normalized and rounded, so that each row adds up to approximately 100% (corresponding to 720 presentations).
CM for vowel phonemes, obtained with Set (SNR −6.2 dB SNR). The average recognition rate is 80.5%. Rows are normalized, with 100% corresponding to 1152 presentations. For a detailed description, see Table IV.
Article metrics loading...
Full text loading...