Index of content:
Volume 104, Issue 1, July 1998
- SPEECH PERCEPTION 
104(1998); http://dx.doi.org/10.1121/1.423299View Description Hide Description
To examine the generality of Strange’s Dynamic Specification Theory of vowel perception, two perceptual experiments investigated whether dynamic (time-varying) acoustic information about vowel gestures was critical for identification of coarticulated vowels in German, a language without diphthongization. The perception by native North German (NG) speakers of electronically modified /dVt/ syllables produced in carrier sentences was assessed using the “silent-center” paradigm. The relative efficacy of static target information, dynamic spectral information (defined over syllable onsets and offsets together), and intrinsic vowel length was investigated in listening conditions in which the centers (silent-center conditions) or the onsets and offsets (vowel-center conditions) of the syllables were silenced. Listeners correctly identified most vowels in silent-center syllables and in vowel-center stimuli when both conditions included information about intrinsic vowel length. When duration information was removed, errors increased significantly, but performance was relatively better for silent-center syllables than for vowel-center stimuli. Acoustical analyses of the effects of coarticulation on target formant frequencies, vocalic duration, and dynamic spectro-temporal patterns in the stimulus materials were performed to elucidate the nature of the dynamic spectral information. In comparison with vowels produced in citation form /hVt/ syllables by the same speaker, the coarticulated /dVt/ utterances showed considerable “target undershoot” of formant frequencies and reduced duration differences between tense and lax vowel pairs. This suggests that both static spectral cues and relative duration information for NG vowels may not remain perceptually distinctive in continuous speech. Analysis of formant movement within syllable nuclei corroborated descriptions of German vowels as monophthongal. However, an analysis of first formanttemporal trajectories revealed distinct patterns for tense and lax vowels that could be used by listeners to disambiguate coarticulated NG vowels.
104(1998); http://dx.doi.org/10.1121/1.423251View Description Hide Description
Recent studies have shown that temporal waveform envelope cues can provide significant information for English speech recognition. This study investigated the use of temporal envelope cues in a tonal language: Mandarin Chinese. In this study, the speech was divided into several frequency analysis bands; the amplitude envelope was extracted from each band by half-wave rectification and low-pass filtering and was used to modulate a noise of the same bandwidth as the analysis band. These manipulations preserved temporal and amplitude cues in each frequency band, but removed the spectral detail within each band. Chinese vowels, consonants, tones and sentences were identified by 12 native Chinese-speaking listeners with 1, 2, 3, and 4 noise bands. The results showed that the recognition score of vowels, consonants, and sentences increased monotonically with the number of bands, a pattern similar to that observed in English speech recognition. In contrast, tones were consistently recognized at about 80% correct level, independent of the number of bands. This high level of tone recognition produced a significant difference in the open-set sentence recognition between Chinese (11.0%) and English (2.9%) for the one-band condition where no spectral information was available. The data also revealed that, with primarily temporal cues, the falling–rising tone(tone 3) and the falling tone(tone 4) were more easily recognized than the flat tone(tone 1) and the rising tone(tone 2). This differential pattern in tone recognition resulted in a similar pattern in word recognition: words having either tone 3 or 4 were more likely to be recognized while words having tone 1 and 2 were not. The quantitative role of tones in Chinese speech recognition was further explored using a power-function model and found to play a significant role in relating phoneme recognition to sentence recognition.
104(1998); http://dx.doi.org/10.1121/1.423252View Description Hide Description
The goals of this study were (i) to assess the replicability of the “perceptual magnet effect” [, J. Acoust. Soc. Am. 97(1), 553–561 (1995)] and (ii) to investigate neurophysiologic processes underlying the perceptual magneteffect by using the mismatch negativity (MMN) auditory evoked potential. A stimulus continuum from /i/ to /e/ was synthesized by varying and in equal mel steps. Ten adult subjects identified and rated the goodness of the stimuli. Results revealed that the prototype was the stimulus with the lowest and highest values and the nonprototype stimulus was close to the category boundary. Subjects discriminated stimulus pairs differing in equal mel steps. The results indicated that discrimination accuracy was not significantly different in the prototype and the nonprototype condition. That is, no perceptual magneteffect was observed. The MMN evoked potential (a preattentive, neurophysiologic index of auditory discrimination) revealed that despite equal mel differences between the stimulus pairs the MMN was largest for the prototype pair (i.e., the pair that had the lowest and highest values). Therefore the MMN appears to be sensitive to within category acoustic differences. Taken together, the behavioral and electrophysiologic results indicate that discrimination of stimulus pairs near a prototype is based on the auditory structure of the stimulus pairs.
104(1998); http://dx.doi.org/10.1121/1.423253View Description Hide Description
Two experiments examined the effects of temporal overlap of speech gestures on the perception of stop consonant clusters. Sequences of stop consonant gestures that exhibit temporal overlap extreme enough to potentially eliminate the acoustic evidence of (at least) one of the consonants were obtained from x-ray microbeam data. Subjects were given a consonant monitoring task using stimuli containing stop sequences as well as those containing single stops. Results showed that (1) the initial consonant in the stop sequences was detected significantly less often than in the single stops; (2) bilabial gestures were considerably more effective at obscuring a preceding alveolar than the reverse; and (3) the detection rate correlated with an index of overlap between lip and tongue tip gestures. Experiment 2 employed stimuli that were truncated during the closure for the critical stop or stop sequence, so as to eliminate any information occurring in the acoustic signal at the stop release. This experiment showed that removing release information decreased detectability of the consonants generally. However, consistent with the observed gestural patterns, removing the release did not decrease detection of the alveolar stop when it was the first consonant of a sequence, indicating that there was no information about the alveolar stop present in acoustic realization of the second stop release. These experiments show that certain gestural patterns actually produced by English speakers may not be completely recoverable by listeners, and further, that it is possible to relate recoverability to particular metric properties of the gestural pattern.
104(1998); http://dx.doi.org/10.1121/1.423300View Description Hide Description
The time course of audiovisual information in speech perception was examined using a gating paradigm. VCVs that evoked the McGurk effect were gated visually and auditorily. The visual gating yielded a McGurk effect that increased in strength as a linear function of amount of visual stimulus presented. The acoustic gating revealed a more nonlinear function in which the VC information was considerably weaker than the CV portion of the VCV. The results suggest that the flow of cross-modal information is quite complex during audiovisual speech perception.