Volume 116, Issue 2, August 2004
Index of content:
- SPEECH PROCESSING AND COMMUNICATION SYSTEMS 
116(2004); http://dx.doi.org/10.1121/1.1768958View Description Hide Description
The goal of this study was to establish the ability of normal-hearing listeners to discriminate formant frequency in vowels in everyday speech.Vowelformantdiscrimination in syllables, phrases, and sentences was measured for high-fidelity (nearly natural) speech synthesized by STRAIGHT [Kawahara et al., Speech Commun. 27, 187–207 (1999)]. Thresholds were measured for changes in F1 and F2 for the vowels /ɪ, ɛ, æ, Λ/ in /bVd/ syllables. Experimental factors manipulated included phonetic context (syllables, phrases, and sentences), sentence discrimination with the addition of an identification task, and word position. Results showed that neither longer phonetic context nor the addition of the identification task significantly affected thresholds, while thresholds for word final position showed significantly better performance than for either initial or middle position in sentences. Results suggest that an average of 0.37 barks is required for normal-hearing listeners to discriminate vowelformants in modest length sentences, elevated by 84% compared to isolated vowels.Vowelformantdiscrimination in several phonetic contexts was slightly elevated for STRAIGHT-synthesized speech compared to formant-synthesized speech stimuli reported in the study by Kewley-Port and Zheng [J. Acoust. Soc. Am. 106, 2945–2958 (1999)]. These elevated thresholds appeared related to greater spectral-temporal variability for high-fidelity speech produced by STRAIGHT than for formant-synthesized speech.
Frequent word section extraction in a presentation speech by an effective dynamic programming algorithm116(2004); http://dx.doi.org/10.1121/1.1764834View Description Hide Description
Word frequency in a document has often been utilized in text searching and summarization. Similarly, identifying frequent words or phrases in a speech data set for searching and summarization would also be meaningful. However, obtaining word frequency in a speech data set is difficult, because frequent words are often special terms in the speech and cannot be recognized by a general speech recognizer. This paper proposes another approach that is effective for automatic extraction of such frequent word sections in a speech data set. The proposed method is applicable to any domain of monologue speech, because no language models or specific terms are required in advance. The extracted sections can be regarded as speech labels of some kind or a digest of the speech presentation. The frequent word sections are determined by detecting similar sections, which are sections of audio data that represent the same word or phrase. The similar sections are detected by an efficient algorithm, called Shift Continuous Dynamic Programming (Shift CDP), which realizes fast matching between arbitrary sections in the reference speech pattern and those in the input speech, and enables frame-synchronous extraction of similar sections. In experiments, the algorithm is applied to extract the repeated sections in oral presentation speeches recorded in academic conferences in Japan. The results show that Shift CDP successfully detects similar sections and identifies the frequent word sections in individual presentation speeches, without prior domain knowledge, such as language models and terms.