Volume 130, Issue 4, October 2011
Index of content:
- SPEECH PERCEPTION 
130(2011); http://dx.doi.org/10.1121/1.3631668View Description Hide Description
For a mixture of target speech and noise in anechoic conditions, the ideal binary mask is defined as follows: It selects the time-frequency units where target energy exceeds noise energy by a certain local threshold and cancels the other units. In this study, the definition of the ideal binary mask is extended to reverberant conditions. Given the division between early and late reflections in terms of speech intelligibility, three ideal binary masks can be defined: an ideal binary mask that uses the direct path of the target as the desired signal, an ideal binary mask that uses the direct path and early reflections of the target as the desired signal, and an ideal binary mask that uses the reverberant target as the desired signal. The effects of these ideal binary mask definitions on speech intelligibility are compared across two types of interference: speech shaped noise and concurrent female speech. As suggested by psychoacoustical studies, the ideal binary mask based on the direct path and early reflections of target speech outperforms the other masks as reverberation time increases and produces substantial reductions in terms of speech reception threshold for normal hearing listeners.
The dynamic range of useful temporal fine structure cues for speech in the presence of a competing talker130(2011); http://dx.doi.org/10.1121/1.3625237View Description Hide Description
Within an auditory channel, the speech waveform contains both temporal envelope (EO ) and temporal fine structure (TFS) information. Vocoder processing extracts a modified version of the temporal envelope (E′) within each channel and uses it to modulate a channel carrier. The resulting signal, E′Carr, has reduced information content compared to the original “EO + TFS” signal. The dynamic range over which listeners make additional use of EO + TFS over E′Carr cues was investigated in a competing-speech task. The target-and-background mixture was processed using a 30-channel vocoder. In each channel, EO + TFS replaced E′Carr at either the peaks or the valleys of the signal. The replacement decision was based on comparing the short-term channel level to a parametrically varied “switching threshold,” expressed relative to the long-term channel level. Intelligibility was measured as a function of switching threshold, carrier type, target-to-background ratio, and replacement method. Scores showed a dependence on all four parameters. Derived intensity-importance functions (IIFs) showed that EO + TFS information from 8–13 dB below to 10 dB above the channel long-term level was important. When EO + TFS information was added at the peaks, IIFs peaked around −2 dB, but when EO + TFS information was added at the valleys, the peaks lay around +1 dB.
130(2011); http://dx.doi.org/10.1121/1.3631667View Description Hide Description
Linear prediction is a widely available technique for analyzingacoustic properties of speech, although this method is known to be error-prone. New tests assessed the adequacy of linear prediction estimates by using this method to derive synthesis parameters and testing the intelligibility of the synthetic speech that results. Matched sets of sine-wave sentences were created, one set using uncorrected linear prediction estimates of natural sentences, the other using estimates made by hand. Phoneme restrictions imposed on linguistic properties allowed comparisons between continuous and intermittent voicing, oral or nasal and fricative manner, and unrestricted phonemic variation. Intelligibility tests revealed uniformly good performance with sentences created by hand-estimation and a minimal decrease in intelligibility with estimation by linear prediction due to manner variation with continuous voicing. Poorer performance was observed when linear prediction estimates were used to produce synthetic versions of phonemically unrestricted sentences, but no similar decline was observed with synthetic sentences produced by hand estimation. The results show a substantial intelligibility cost of reliance on uncorrected linear prediction estimates when phonemic variation approaches natural incidence.