Volume 115, Issue 3, March 2004
Index of content:
- SPEECH PROCESSING AND COMMUNICATION SYSTEMS 
115(2004); http://dx.doi.org/10.1121/1.1646400View Description Hide Description
Studies by Shannon et al. [Science, 270, 303–304 (1995)], Van Tasell et al. [J. Acoust. Soc. Am. 82, 1152–1161 (1987)], and others show that human listeners can understand important aspects of the speech signal when spectral shape has been significantly degraded. These experiments suggest that temporal information is particularly important in human speech perception when the speech signal is heavily degraded. In this study, a system is developed that extracts linguistically relevant temporal information that can be used in the front end of an automatic speech recognition system. The parameters targeted include energy onset and offsets (computed using an adaptive algorithm) and measures of periodic and aperiodic content; together these are used to find abrupt acoustic events which signify landmarks. Overall detection rates for strongly robust events, robust events, and weak events in a portion of the TIMIT test database are 98.9%, 94.7%, and 52.1%, respectively. Error rates increase by less than 5% when the speech signals are spectrally impoverished. Use of the four temporal parameters as the front end of a hidden Markovmodel (HMM)-based system for the automatic recognition of the manner classes “sonorant,” “fricative,” “stop,” and “silence” results in the same recognition accuracy achieved when the standard 39 cepstral-based parameters are used, 70.1%. The combination of the temporal parameters and cepstral parameters results in an accuracy of 74.8%.