A method to identify noise-robust perceptual features: Application for consonant /t/
(Color online) AI-gram block diagram. Derived from the work of French and Steinberg (1947) and Fletcher (1995), the output of the AI gram relates to the audibility of a sound. When a speech signal is visible in different degrees of black on the AI gram, it is above the masked threshold (i.e., audible). It follows that all noise and inaudible sounds having a SNR less than appear in white due to the band normalization by the noise.
AI gram of male speaker 111 speaking /tɑ/ in (a) SWN at SNR and (b) WN at SNR. The audible speech information is dark, the different levels representing the degrees of audibility. Since the two different noises have different spectra, the speech is masked differently. Speech-weighted noise masks low frequencies and high frequencies equally on average. One may clearly see the strong masking due to WN at high frequencies. The AI gram is an important tool used here to explain the differences in CPs observed in many studies to connect the physical and perceptual domains.
(Color online) Confusion patterns for /tɑ/ spoken by female talker 105 in (a) SWN and (b) white noise. Note the significant robustness difference depending on the noise spectrum. In SWN, /t/ is correctly identified down to SNR, whereas it starts to decrease at in WN. corresponds to the SNR at which the error starts to increase. The confusions are also more significant in WN, with the scores for /p/ and /k/ overcoming that of /t/ below . We call this surprising observation morphing. The maximum confusion score is denoted . The reasons for this robustness difference depend on the audibility of the /t/ event, which will be analyzed in the next section.
(Color online) Comparison between a “weak” (top, m117te) and a “strong” (robust) (bottom, m112te) /tε/. The arrangement of the four panes is optimized for inner subfigure comparisons. Step 1 provides to the CPs (bottom right), step 2 to the AI gram at SNR in SWN, step 3 to the mean AI above where the local maximum in the burst is identified, leading to step 4, the event gram (vertical slice through AI grams at ). Note that for the same SWN masking noise, these utterances behave differently and present different competitors. Utterance m117te strongly morphs to /pε/. Many of these differences can be explained by the AI gram and more specifically by the event gram which quantifies the /t/-burst threshold, and therefore its robustness to noise. This threshold is precisely correlated with the human responses (encircled). This leads to the conclusion that this across-frequency onset transient is the primary /t/ event. (a) Analysis of sound /tε/ spoken by male talker 117 in SWN. This utterance is not robust to noise since the /t/ recognition starts to decrease at SNR. Identifying , time of the burst maximum at SNR in the AI gram (top left), and its mean in the range (bottom left), leads to the event gram (top right). The vertical dashed line on the AI gram shows . On the event gram, the dashed line shows the SNR at which the AI gram was displaced (similar to a vertical slice). In both cases, the horizontal dashed line marks the lower frequency limit of the burst (here ) This representation of the audible phone /t/ burst information at time is highly correlated with and the CPs: when the burst information becomes inaudible (white on the AI gram), /t/ score decreases, as indicated by the ellipses. (b) Analysis of sound /tε/ spoken by male talker 112 in SWN. Unlike the case of m117te, this utterance is robust to SWN and identified down to SNR. Again, the burst threshold defined by the event gram (top right) is related to defined by the CP, accounting for the robustness of consonant /t/.
(Color online) This variance event gram was computed by taking event grams of a /tɑ/ utterance for ten different noise samples in SWN (PA07). We can see that all the variance is located on the edges of the audible speech energy, where the noise and speech have similar level, located between regions of high audibility and regions of noise. However, the spread is thin, showing that the use of different noise samples will not significantly impact the perceptual scores.
(Color online) (a) Scatter plot of the event-gram thresholds above , computed for the optimal burst bandwidth having an AI density greater than the optimal threshold compared to the SNR of 90% score. Utterances in SWN (+) are more robust than those in WN (○), accounting for the large spread in SNR. We can see that most utterances are close to the 45° line, showing the high correlation between the AI-gram audibility model (middle pane) and the event gram (right pane). The detection of the event-gram threshold is shown on the event gram in SWN [top pane of (b)] and WN [top pane of (c)], between the two horizontal lines, for f106ta, and placed above their corresponding CPs. is located at the lowest SNR where there is continuous energy above and spread in frequency with a width of above AI threshold . We can notice the effect of the noise spectrum on the event gram, accounting for the difference in robustness between WN and SWN.
(Color online) Group 1 utterances are defined as those which morph as . For each panel, the top plot represents responses at SNR and the lower those at SNR. There is no significant SNR effect for these sounds. (a) Truncation of f105ta at 12 (top) and SNRs (bottom). (b) Truncation of f109ta at 12 (top) and SNRs (bottom). (c) Truncation of f119ta at 12 (top) and SNRs (bottom). (d) Truncation of m111ta at 12 (top) and SNRs (bottom).
(Color online) Utterances of group 2: Consonant /h/ strongly competes with /p/ (top), along with /k/ (bottom). For the top right and left panels: increasing the noise to SNR causes an increase in the /h/ confusion in the /p/ morph range. For the two bottom utterances, decreasing the SNR causes a /k/ confusion that was nonexistent at , equating the scores for competitors /k/ and /h/. (a) Truncation of m102ta at 12 (top) and SNRs (bottom). (b) Truncation of m104ta at 12 (top) and SNRs (bottom). (c) Truncation of m107ta at 12 (top) and SNRs (bottom). (d) Truncation of m117ta at 12 (top) and SNRs (bottom).
(Color online) Truncation of f113ta at 12 (top) and SNRs (bottom): Consonant /t/ morphs to /p/, which is slightly confused with /h/. There is no significant SNR effect.
(Color online) (a) SNR zoomed AI gram. (b) SNR zoomed AI gram. AI grams of m120ta, zoomed to a duration of , in the consonant and transition regions at (a) SNR and (b) SNR. Below each AI gram are plotted the listener responses as a function of the truncation time, time synchronized. Uniquely for this utterance, the /t/ identification is still high after of truncation, presumably because of the long-duration residual high-frequency (i.e., ) energy. The target probability even overcomes the score for /p/ at SNR at a truncation time of , most likely because of a strong relative /p/ event present at but weaker at .
Article metrics loading...
Full text loading...