Index of content:
Volume 120, Issue 1, July 2006
- SPEECH PROCESSING AND COMMUNICATION SYSTEMS 
120(2006); http://dx.doi.org/10.1121/1.2205131View Description Hide Description
This paper proposes a speech feature extraction method that utilizes periodicity and nonperiodicity for robust automatic speech recognition. The method was motivated by the auditory comb filtering hypothesis proposed in speech perception research. The method divides input signals into subband signals, which it then decomposes into their periodic and nonperiodic components using comb filters independently designed in each subband. Both features are used as feature parameters. This representation exploits the robustness of periodicity measurements as regards noise while preserving the overall speech information content. In addition, periodicity is estimated independently in each subband, providing robustness as regards noise spectrum bias. The framework is similar to that of a previous study [Jackson et al. , Proc. of Eurospeech. (2003), pp. 2321–2324], which is based on cascade processing motivated by speech production. However, the proposed method differs in its design philosophy, which is based on parallel distributed processing motivated by speech perception. Continuous digit speech recognition experiments in the presence of noise confirmed that the proposed method performs better than conventional methods when the noise in the training and test data sets differs.
Effects of vocal loudness variation on spectrum balance as reflected by the alpha measure of long-term-average spectra of speech120(2006); http://dx.doi.org/10.1121/1.2208451View Description Hide Description
The overall slope of long-term-average spectrum (LTAS) decreases if vocal loudness increases. Therefore, changes of vocal loudness also affects the measure, defined as the ratio of spectrum intensity above and below . The effect on of loudness variation was analyzed in 15 male and 16 female voices reading a text at different degrees of vocal loudness. The mean range of equivalent sound level amounted to about and the mean range of to 19.0 and for the female and male subjects. The vs. relationship could be approximated with a quadratic function, or by a linear equation, if softest phonation was excluded. Using such equations was computed for all values of observed for each subject and compared with observed values. The maximum and the mean absolute errors were and between 0.1 and . When softest phonation was disregarded and linear equations were used, the maximum error was less than and the mean absolute errors were between 0.2 and . The strong correlation between and indicates that for a voice can be used for predicting .
120(2006); http://dx.doi.org/10.1121/1.2204590View Description Hide Description
In everyday listening, both background noise and reverberation degrade the speech signal. Psychoacoustic evidence suggests that human speech perception under reverberant conditions relies mostly on monaural processing. While speech segregation based on periodicity has achieved considerable progress in handling additive noise, little research in monaural segregation has been devoted to reverberant scenarios. Reverberation smears the harmonic structure of speech signals, and our evaluations using a pitch-based segregation algorithm show that an increase in the room reverberation time causes degraded performance due to weakened periodicity in the target signal. We propose a two-stage monaural separation system that combines the inverse filtering of the room impulse response corresponding to target location and a pitch-based speech segregation method. As a result of the first stage, the harmonicity of a signal arriving from target direction is partially restored while signals arriving from other directions are further smeared, and this leads to improved segregation. A systematic evaluation of the system shows that the proposed system results in considerable signal-to-noise ratio gains across different conditions. Potential applications of this system include robust automatic speech recognition and hearing aid design.
An effective cluster-based model for robust speech detection and speech recognition in noisy environments120(2006); http://dx.doi.org/10.1121/1.2208450View Description Hide Description
This paper shows an accurate speech detection algorithm for improving the performance of speech recognitionsystems working in noisy environments. The proposed method is based on a hard decision clustering approach where a set of prototypes is used to characterize the noisy channel. Detecting the presence of speech is enabled by a decision rule formulated in terms of an averaged distance between the observation vector and a cluster-based noisemodel. The algorithm benefits from using contextual information, a strategy that considers not only a single speech frame but also a neighborhood of data in order to smooth the decision function and improve speech detection robustness. The proposed scheme exhibits reduced computational cost making it adequate for real time applications, i.e., automated speech recognitionsystems. An exhaustive analysis is conducted on the AURORA 2 and AURORA 3 databases in order to assess the performance of the algorithm and to compare it to existing standard voice activity detection(VAD) methods. The results show significant improvements in detection accuracy and speech recognition rate over standard VADs such as ITU-T G.729, ETSI GSM AMR, and ETSI AFE for distributed speech recognition and a representative set of recently reported VAD algorithms.