The influence of stop consonants’ perceptual features on the Articulation Index model
(Color online) Error distribution of 56 /p/ utterances in the low-noise (SNR ≥ −2 dB) environment: The total number of utterances as marked above the topmost block is 56 (14 consonants with 4 vowels). The zero-error (ZE) group is the leftmost and contains 41 of the 56 utterances as marked above the block. The number above a block gives the size of the group, i.e., number of utterances of 56 that belong to that group. Of the remaining 15 (56-41) utterances, the next level shows the number of errors made in the low-noise environment. From the figure, 11 utterances have 1 error (of 38 trials on average), forming the low error (LE) group. Four utterances (m107pe, f113pI, m112pI, and f106pI) have 3, 5, 7, and 22 errors, respectively. The first utterance (m107pe) belongs to the medium error group (ME), and the last three have an error greater than 12% (Table I), thus belong to the high error (HE) group.
Stacked bar-plots give the relative errors made by the six stop consonants in speech-weighted noise in the low-noise environment. The abscissa shows the six consonants, arranged in order of decreasing number of utterances in the ZE group (order of decreasing salience). The ordinate indicates the number of utterances of the consonant that falls into the ZE, LE, ME, and HE groups, respectively. The total is always 56. ZE is the zero-error group that contains utterances that all listeners gave correct responses at −2 dB SNR and quiet. LE is the low error group having low-grade random errors. ME is the medium error group with utterances having between 3 and 12% error. HE group utterances have errors greater than 12% and are primarily due to production errors. These are always ambiguous/primable utterances with high errors and low entropy. ZE and LE groups together form the robust zero error (RZE) group.
(Color online) This figure shows the probability of error P e (SNR) for the 56 /p/ utterances, broken down into the four error groups as defined in Sec. II D. In each panel, the thick dashed curve is the grand-mean [μ(SNR)] across all the 56 /p/ utterances while the thick dashed-dotted curve is the grand standard deviation [σ(SNR)], as labeled in (d). Here the quiet condition (indicated as Q) is arbitrarily assigned to 6 dB (Phatak and Allen, 2007). (a) shows the 41 ZE scores [P e (SNR ≥ −2) = 0]. (b) shows the 3 HE error sounds (P e ≥ 12 [%]), along with their mean [μ HE (SNR)] (thin dashed-dotted). (c) shows the 11 LE sounds (3 < P e < 12 [%]). (d) Besides the one ME sound, also shown [solid line superimposed on the grand mean μ(SNR) thick dashed line] is the AI model error for /p/ computed from the AI error formula (lower-left), with e min = 0.035 (3.5%) and . The RMS error between μ(SNR) and the AI error formula is 0.75%. Also shown (thin dashed-dotted lines) are the means μ ZE (SNR), μ LE (SNR), and μ HE (SNR), for the ZE, LE, and HE groups, respectively.
(Color online) AI-gram of f106pI at 0 dB SNR. The conflicting cue is marked by a solid box. This clearly shows a high frequency conflicting /t/ burst (Régnier and Allen, 2008; Li and Allen, 2011). The utterance is primable as either /p/ or /t/. Correspondingly the error is 56%. The time axis is labeled in centiseconds [cs] (1 cs = 10 ms). Centisecond units are naturally relevant to speech perception.
(Color online) AI-grams at 6 dB SNR. In both the AI-grams, the solid box is the /k/ feature while the start of the vowel is marked by a solid line. We see that the burst cue is very close to the beginning of the vowel, which is a characteristic of the /g/ feature (Li et al., 2010), thereby explaining why these two /k/ utterances are highly confusable with /g/.
(Color online) This figure shows the distribution of errors of the 56 utterances of b. The colors in (a) and (b) indicate the four vowels. Quiet is arbitrarily marked at 18 dB and for (b) is joined to −2 by dashed lines. (a) Error vs SNR plot of the 11 ZE utterances. (b) Error vs SNR plot of the 45 NZE utterances. (c) Breaking down the errors in the low-noise environment, based on the absolute number of errors made. Twenty-two utterances are in the RZE group. Twenty-five (44%) utterances are HE utterances.
(Color online) Log-error vs SNR for /b/ (average over 56 utterances) for the 14 listeners who completed the experiment (PA07). The grand average error over these 14 listeners is shown by a dashed line. The legend indicates each listener with a two-letter ID. In quiet, there were six listeners having greater than average error: AN, BH, LT, QN, CB, and SP. The four listeners removed from the PA07 analysis were AN, BH, LT, and QN (not CB and SP). We see from the figure that other than for quiet, QN was the best listener. For this figure Q was arbitrarily defined as 18 [dB] SNR.
(Color online) (a) Individual /p/ error curves aligned at their 50% error values. The solid line shows the average “master error curve,” which falls from 75% to 25% error over 6 dB. (b) Histogram of the shifts SNR 50 for each /p/ utterance, required to shift to the average (i.e., the master curve). Individual error curves are aligned at their 50% error values at −16 dB (as defined by the solid line). (c) Average log-linear error curves for the six stop consonants, with AI = 1 marked at −2 dB SNR. Log-linear regression fits have correlation coefficients of 0.990, 0.997, 0.981, 0.996, 0.998, and 0.992 for /p/, /t/,/k/,/b/,/d/, and /g/, respectively. The average of these six curves is the thick dashed line labled μ(SNR) of Fig. 3(d). (d) Histogram of the perceptual thresholds SNR 90 values for 55 /p/ utterances [utterance f106pI never reaches 100% score (i.e., SNR 90 = ∞)]. If we ignore the three outliers having high (>0) threshold values, the remaining SNR 90 values have a dynamic range of ≈ 20 dB. This is approaches the AI’s 30 dB dynamic range, defined across all utterances (French and Steinberg, 1947).
Percentage error, N and SNR 90 values for the 15 NZE utterances of /p/, shown in Fig. 1. The table is divided into three groups with horizontal lines. The top 11 utterances have exactly 1 error (<3%) thus = 1 so we interpret these errors as random. The last three utterances (f113pI, m112pI, and f106pI) having more than 12% error thus belong to the high error (HE) group. Utterance m107pe is a lone member of the medium error (ME) group. The SNR 90 (the SNR at which the score drops from 100% to 90%) is highly correlated with the acoustic feature threshold [Fig. 6a from Régnier and Allen (2008)] and is taken as an objective measure of the robustness of the sound. As seen from the tabulated values, ME and HE utterances have high (≥ 2 dB) SNR90 thresholds. Thus they are easily confusable, even in the low-noise environment. In particular, f106pI has more than 50% error even in quiet, thus its SNR 90 value is ∞. LE utterances have low values for SNR90 (< 2 dB) thus are robust. Therefore they should ideally be classified as in the zero error (ZE) group.
Percentage error, N and SNR 90 values for NZE utterances of /t/. Ten utterances in the topmost block with a single error (effectively less than 3% error) belong to the LE group, the next four in the middle block are ME utterances, while m117te and f103te are HE ambiguous utterances. The HE utterances have high SNR 90 thresholds as seen in the table.
Percentage error, N and SNR 90 values for NZE utterances of /k/ in the low-noise environment. The NZE group is half that of /p/ and /t/. We interpret /k/ as having high salience, meaning it is easily articulated and easily identified (i.e., it is naturally robust). The top three utterances belong to the LE group, the next two to the ME group, and the last two are HE utterances (with high SNR 90 values).
Percentage error, N and SNR 90 values forNZE utterances of /b/. The horizontal line is the demarcation between 10 low error (LE) utterances (above) and the 10 medium error (ME) utterances (below). The entire right column of the table is the HE utterances (25 in total of the 45 NZE utterances that have errors). Clearly, /b/ is a difficult sound compared to the other five stop consonants because a majority of its utterances have high errors. Such high errors are likely to be due to production errors as evidenced by the fact that one talker (f101) has no high error (just one utterance f101bI has a single random error). The 11 ZE sounds demonstrate that the listeners can hear a well articulated /b/. For most HE sounds, /b/ is confused with /v/ and /f/. These HE utterances have high thresholds and most do not reach 90% score, even in quiet.
Percentage error, N and SNR90 values for NZE utterances of /d/. The left four columns contain the 12 LE utterances. The horizontal line on the right four columns is the demarcation between the 13 medium error (ME) utterances (above), and the 4 high error (HE) utterances (below). The SNR 90 values are well correlated with these three groups: LE sounds have low thresholds while HE sounds have high perceptual thresholds, even ∞ for sounds whose score does not reach 90% even in quiet.
Percentage error, N and SNR 90 values for NZE utterances of /g/. All the 56 /g/ utterances used in the experiment are well-articulated and have no high errors. The utterances in the left four columns form LE group while the right three column utterances belong to the ME group. All NZE utterances have SNR 90 threshold below −2 dB SNR.
Article metrics loading...
Full text loading...