Index of content:
Volume 106, Issue 4, October 1999
- SPEECH PERCEPTION 
106(1999); http://dx.doi.org/10.1121/1.427950View Description Hide Description
A front end for automatic speech recognizers is proposed and evaluated which is based on a quantitative model of the “effective” peripheral auditory processing. The model simulates both spectral and temporal properties of sound processing in the auditory system which were found in psychoacoustical and physiological experiments. The robustness of the auditory-based representation of speech was evaluated in speaker-independent, isolated word recognition experiments in different types of additive noise. The results show a higher robustness of the auditory front end in noise, compared to common mel-scale cepstral feature extraction. In a second set of experiments, different processing stages of the auditory front end were modified to study their contribution to robust speech signal representation in detail. The adaptive compression stage which enhances temporal changes of the input signal appeared to be the most important processing stage towards robust speech representation in noise. Low-pass filtering of the fast fluctuating envelope in each frequency band further reduces the influence of noise in the auditory-based representation of speech.
106(1999); http://dx.doi.org/10.1121/1.428056View Description Hide Description
On the basis of theoretical considerations and results from acoustic and perceptual analyses, it is hypothesized that closure duration is the primary cue for gemination in Italian. Results of an acoustic analysis of a large number of single and geminate Italian utterances show two acoustic correlates of gemination: the length of the closure and the length of the vowel preceding the consonant. Other acoustic parameters were not systematically related to gemination. These results were validated perceptually. At the perceptual level, the above cues were used by the listeners in the geminate/nongeminate discrimination; however, closure duration played a major role. Moreover, it was found that the significant lengthening of consonant was only partially compensated by the shortening of the previous vowel and by a small lengthening of the geminate utterance with respect to the nongeminate one. This result suggests that speakers follow a sort of timing (rhythm) which is fixed in duration and depends on the number of syllables in the word: words with equal numbers of syllables do not change in utterance length, an elongated segment being partly compensated by the shortening of another. This process seems to be applied also perceptually suggesting that the timing (rhythm) of a language is also an auditory attitude.
Contributions of temporal encodings of voicing, voicelessness, fundamental frequency, and amplitude variation to audio-visual and auditory speech perception106(1999); http://dx.doi.org/10.1121/1.427951View Description Hide Description
Auditory and audio-visual speech perception was investigated using auditory signals of invariant spectral envelope that temporally encoded the presence of voiced and voiceless excitation, variations in amplitude envelope and In experiment 1, the contribution of the timing of voicing was compared in consonant identification to the additional effects of variations in and the amplitude of voiced speech. In audio-visual conditions only, amplitude variation slightly increased accuracy globally and for manner features. variation slightly increased overall accuracy and manner perception in auditory and audio-visual conditions. Experiment 2 examined consonant information derived from the presence and amplitude variation of voiceless speech in addition to that from voicing, and voiced speech amplitude. Binary indication of voiceless excitation improved accuracy overall and for voicing and manner. The amplitude variation of voiceless speech produced only a small increment in place of articulation scores. A final experiment examined audio-visual sentence perception using encodings of voiceless excitation and amplitude variation added to a signal representing voicing and There was a contribution of amplitude variation to sentence perception, but not of voiceless excitation. The timing of voiced and voiceless excitation appears to be the major temporal cues to consonant identity.
Recognition of spoken words by native and non-native listeners: Talker-, listener-, and item-related factors106(1999); http://dx.doi.org/10.1121/1.427952View Description Hide Description
In order to gain insight into the interplay between the talker-, listener-, and item-related factors that influence speech perception, a large multi-talker database of digitally recorded spoken words was developed, and was then submitted to intelligibility tests with multiple listeners. Ten talkers produced two lists of words at three speaking rates. One list contained lexically “easy” words (words with few phonetically similar sounding “neighbors” with which they could be confused), and the other list contained lexically “hard” words (words with many phonetically similar sounding “neighbors”). An analysis of the intelligibility data obtained with native speakers of English (experiment 1) showed a strong effect of lexical similarity. Easy words had higher intelligibility scores than hard words. A strong effect of speaking rate was also found whereby slow and medium rate words had higher intelligibility scores than fast rate words. Finally, a relationship was also observed between the various stimulus factors whereby the perceptual difficulties imposed by one factor, such as a hard word spoken at a fast rate, could be overcome by the advantage gained through the listener’s experience and familiarity with the speech of a particular talker. In experiment 2, the investigation was extended to another listener population, namely, non-native listeners. Results showed that the ability to take advantage of surface phonetic information, such as a consistent talker across items, is a perceptual skill that transfers easily from first to second language perception. However, non-native listeners had particular difficulty with lexically hard words even when familiarity with the items was controlled, suggesting that non-native word recognition may be compromised when fine phonetic discrimination at the segmental level is required. Taken together, the results of this study provide insight into the signal-dependent and signal-independent factors that influence spoken language processing in native and non-native listeners.
Effects of lengthened formant transition duration on discrimination and neural representation of synthetic CV syllables by normal and learning-disabled children106(1999); http://dx.doi.org/10.1121/1.427953View Description Hide Description
In order to investigate the precise acoustic features of stop consonants that pose perceptual difficulties for some children with learning problems, discrimination thresholds along two separate synthetic /da-ga/ continua were compared in a group of children with learning problems (LP) and a group of normal children. The continua differed only in the duration of the formant transitions. Results showed that simply lengthening the formant transition duration from 40 to 80 ms did not result in improved discrimination thresholds for the LP group relative to the normal group. Consistent with previous findings, an electrophysiologic response that is known to reflect the brain’s representation of a change from one auditory stimulus to another—the mismatch negativity (MMN)—indicated diminished responses in the LP group relative to the normal group to /da/ versus /ga/ when the transition duration was 40 ms. In the lengthened transition duration condition the MMN responses from the LP group were more similar to those from the normal group, and were enhanced relative to the short transition duration condition. These data suggest that extending the duration of the critical portion of the acoustic stimulus can result in enhanced encoding at a preattentive neural level; however, this stimulus manipulation on its own is not a sufficient acoustic enhancement to facilitate increased perceptual discrimination of this place-of-articulation contrast.