Index of content:
Volume 73, Issue S1, May 1983
- PROGRAM OF THE 105TH MEETING OF THE ACOUSTICAL SOCIETY OF AMERICA
- Session A. Psychological Acoustics I, Physiological Acoustics I, and Architectural Acoustics I: Directional Hearing I
- Invited Papers
73(1983); http://dx.doi.org/10.1121/1.2020273View Description Hide Description
The physical mechanisms of acoustic wave diffraction and scattering by the head and torso, as well as the resonances excited in the external ear produce interaural time differences, interaural pressure level differences, and direction‐dependent pressure spectra at the eardrum. The frequency dispersion of the diffractedwave causes the phase‐derived and envelope‐derived interaural time differences to be frequency dependent and to differ from each other. However, at high frequencies these time differences converge to one common value which equals the delay of the leading edge of a pulse. Interaural pressure level differences at high frequencies result primarily from the frequency‐dependent attenuation of the wavesdiffracted into the shadow region of the head and from the directivity of the pinna. Spectral localization cues which are usually associated with the localization in the vertical, median plane result from the interference of the wavescattered by the torso and the direct wave to the external ear, at low frequencies; transverse wave motion in the pinna appears to produce the direction‐dependent high‐frequency pressure spectra at the eardrum. Mathematical models which describe the wave motion about the head and in the external ear will be described and compared to physical measurements. Potential effects on the localization cues for aided listeners and associated difficulties of these acoustical measurements with real microphones in the external ear will also be discussed. [Portions of this work were supported by the National Science Foundation.]
73(1983); http://dx.doi.org/10.1121/1.2020274View Description Hide Description
To this day it is widely accepted that localization of sounds in auditory space is determined primarily by two cues, interaural differences in stimulus intensity and time of arrival. However, recent research forces us to appreciate the deficiencies of this point of view. It cannot, for example, explain localization in the median plane or localization with one ear, both of which are done with reasonable accuracy. Nor can it account for the fact that stimuli presented (over headphones) only with interaural differences, appear to originate from inside the listener's head. Acoustic measurements (ours and others) show that in addition to interaural time and intensity differences, stimuli in free space produce large direction‐dependent spectral cues, resulting from the acoustic filtering action of outer‐ear structures (pinnae, etc.). These cues are large, and occur in relatively low‐frequency regions (below 5 kHz), even when the source is moved a small amount in the horizontal plane. When stimuli are computer synthesized such that both the spectral and interaural difference cues are preserved, headphone presentation produces a nearly perfect simulation of free‐field listening. Results from several recent experiments conducted in our laboratory and elsewhere lead us to the conclusion that the filtering action of the outer ear is at least as important to our perception of auditory space as is the fact that we listen with two ears [Work supported by NSF.]
73(1983); http://dx.doi.org/10.1121/1.2020321View Description Hide Description
When acoustic stimuli are presented over headphones listeners usually report that a soundimage appears within their head. Interaural differences in time and intensity can move the position of this image within the head. This effect is usually referred to as lateralization. We will review some of the basic results on the lateralization of simple stimuli. The relationship between lateral position and the interaural differences of time and intensity will be summarized. These results are used to predict those values of interaural time and intensity that yield an equivalent lateral position. The results from experiments which measure the just discriminable differences in interaural time and intensity will also be reviewed. The relationship between these lateral discrimination results and those from the lateral position studies will be described. The various models of lateralization all assume some type of cross‐correlation operation. The cross‐correlation approach to modeling these data has some strengths and some weaknesses. Results concerning the interaural differences of onset time and ongoing time will be compared. An attempt will be made to provide a general framework for the basic data concerning the lateral position and lateral discrimination of simple stimuli. The extent to which results from these lateralization experiments can account for our ability to localize sound sources in the free field will be discussed.
73(1983); http://dx.doi.org/10.1121/1.2020322View Description Hide Description
At one time it was felt that interaural differences of time could affect the localization only of low frequencies. Psychoacoustic support for this view came from observations with sinusoidal stimuli that showed a total insensitivity to interaural delays in tones above about 1.5 kHz. In recent years, however, it has become increasingly clear that listeners can detect interaural delay in signals of high‐frequency content, as long as the spectrum is complex. For example, performance can be excellent when the signal is a high‐frequency carrier that has been modulated in amplitude. The purpose of this paper is to discuss functional differences between simple and complex sounds, with an eye toward development of a model of binaural interaction that accounts for the responses to both. Toward this end, special attention will be paid to stimulus factors that differentiate the two classes. In the preceding paper, Yost compares the responses to interaural differences of time in the stimulus onset and in the waveform. That notion will be expanded here in a model which attributes a spectral role to onsets and which visualizes both individual peaks in the envelopes of high frequencies and the oscillations of low frequencies as successions of mini‐onsets.
73(1983); http://dx.doi.org/10.1121/1.2020323View Description Hide Description
Behavioral, anatomical, and electrophysiological evidence indicate that the superior olives are probably the lowest‐level nuclei that can provide for the quick convergence of the binaural activity necessary for sound localization and also for its integration and subsequent redistribution to the higher centers of the auditory system. Because of the nature of the convergence and sorting of the fibers ascending through the hindbrain and because the representation of the auditory field at higher centers is contralateral, the hindbrain auditory system can be profitably viewed as an acoustic “chiasm.” Since the lateral lemniscus is the lowest level of the auditory system where unilateral section results in contralateral deficits in sound localization, recent advances in knowledge about the trajectory and interconnections of fibers up to that level will be reviewed.
73(1983); http://dx.doi.org/10.1121/1.2020324View Description Hide Description
This review concerns physiological studies of the neural mechanisms of sound localization. We will describe the responses of neurons in the central auditory system to sounds presented dichotically or in the free field. The free‐field studies have sought to define the spatial receptive fields of these neurons and their topographical organization in the brain. Although the functional aspects of these cells are most effectively addressed by free‐field methods, the mechanism by which these cells accomplish this task is best studied using dichotic stimulation. Influenced by the duplex theory and human psychoacoustics, the major focus of neurophysiological studies using dichotic stimulation has been the investigation of the effects of varying interaural phase and intensity. The neuronal responses to these stimuli and their relationship to the frequency domain will be discussed with a particular emphasis on the concept of characteristic delay. In an attempt to define the underlying circuitry, we will compare the neural responses to a model of a binaural cell. [This work supported by N.I.H. grants NS18027 (S. Kuwada), EY02606 (T. C. T. Yin), and NS12732 (J. E. Hind).]
- Session B. Speech Communication I: Speech Perception and Speech Synthesis
- Contributed Papers
73(1983); http://dx.doi.org/10.1121/1.2020366View Description Hide Description
This study addressed the issue of whether the perception of vocal fry depends on waveform shape, or whether fry is perceived categorically only as a function of frequency. Voicelike stimuli were generated with a transmission‐line synthesis program [Titze, Transcripts Care Prof. Voice (1983)] that can vary fundamental frequency (F0), open quotient (γ), and speed quotient (δ), while maintaining a realistic glottal shape. In Exp. I, all combinations of F0 (40, 60, 80, and 100 Hz), γ (0.1, 0.35, and 0.6), and δ (1.0, 2.5, 5.0, and 8.0) were judged by ten trained listeners, using a binary fry‐no fry forced response method. The results reinforced previous conclusions that F0 is the primary determinant in the perception of vocal fry [H. Hollien, J. Phonet. 2, 125–143 (1974)] regardless of waveshape. Mean % judged as fry was 99.4%, 75.6%, 21.1%, and 2.77% at 40, 60, 80, and 100 Hz, respectively, with chance level performance (50%) predicted at about 70 Hz. Differences among γ and δ values were statistically nonsignificant. Experiment II included only stimuli of 40 and 60 Hz, each F0 containing all combinations of γ and δ values. A round‐robin tournament was utilized, whereby each stimulus competed against every other stimulus for a total of comparisons. Although intrajudge reliability was high (r≈0.8), interjudge agreement was not consistently as high, yielding nonsignificant differences among γ and δ means, and suggesting that several waveshapes are perceptually plausible within acceptable F0's.
73(1983); http://dx.doi.org/10.1121/1.2020367View Description Hide Description
A recurring proposal in the auditory literature is that variations in frequency and variations in intensity may, in some instances, have equivalent perceptual consequences. Some versions of this proposal argue for the existence of a psychological dimension of “modulation” nearly orthogonal to those of pitch and loudness [e.g., E. Zwickler, J. Acoust. Soc. Am. 34, 1425 (1962)]. Such a dimension may play a role in phonetic perception [L. Chistovich, Fiziol. Chel. 3(6), 1103 (1977)]. The study to be discussed examines this proposal by asking listeners to adjust amplitude variations to match speechlike frequency variations in second formant signals. The frequency variations are produced by changing the center frequency (by 100, 200, 300, or 400 Hz) of a second formant resonance over a 60 or 120 ms (hayersine shaped) interval which is temporally centered in a 600‐ms signal. Comparison signals are temporally analogous second formants in which the depth of a hayersine amplitude modulation can be adjusted. Preliminary results for one listener reveal a systematic relationship between adjusted amplitude depth and the extent of frequency variation in the standard; the nature of the relationship is consistent with a “modulation” perception interpretation. [Supported in part, by NIH/NINCDS NS ♯ 11647 and the Louisiana Eye and Ear Foundation.]
73(1983); http://dx.doi.org/10.1121/1.2020368View Description Hide Description
Several previous experiments have suggested that the intelligibility of synthetic speech can be improved with practice. However, an alternative interpretation of this research is that subjects simply learned to perform the experimental tasks better without any change in intelligibility. To test these alternatives, we conducted an experiment to separate the effects of training on task performance from improvements in the intelligibility of synthetic speech. Three groups of subjects were tested on day 1 (pre‐test) and day 10 (post‐test) of the experiment with synthetic speech generated by the Votrax Type‐N‐Talk text‐to‐speech system. One group received training with synthetic speech on days 2–9; a second group received exactly the same training procedures on days 2–9 with natural speech; the third group received no training at all. Intelligibility was assessed for isolated words, syntactically correct meaningful sentences, syntactically correct but semantically anomalous sentences, and prose passages. In this paper, we will discuss the effects of training measured by changes in intelligibility between the pre‐test and the post‐test. The implications of these results for applications of low‐cost text‐to‐speech systems will be discussed.
The effects of training on intelligibility of synthetic speech: II. The learning curve for synthetic speech73(1983); http://dx.doi.org/10.1121/1.2020369View Description Hide Description
This paper presents the second part of a report on the changes in intelligibility of synthetic speech produced by training. Two groups of subjects received training between the pre‐test and post‐test assessments of the intelligibility of the Votrax Type‐N‐Talk system. One group was explicitly trained on the synthetic speech. The second group received the same training procedures with natural speech. During training, subjects were presented with isolated words, meaningful and anomalous sentences, and prose passages. On each trial for words and sentences after identifying the stimulus, subjects were presented with feedback of a visual presentation of the stimulus and a second auditory presentation. For the first prose passage, subjects were given a printed version to read during auditory presentation of the passage. Subjects listened to three other passages without a printed version and then answered comprehension questions. Accuracy of word identification and response latency and accuracy to comprehension questions were recorded. Based on these data we will discuss the time‐course of learning to perceive synthetic speech more accurately.
73(1983); http://dx.doi.org/10.1121/1.2020419View Description Hide Description
Phonemic restoration is an auditory illusion which arises when a phoneme is removed from a word and replaced with noise. Listeners hear the word as intact: they perceptual]y restore the missing phoneneme. Recently it has been shown that the magnitude of the illusion diminishes with practice [H. C. Nusbaum, A. C. Walley, T. D. Carrell, and W. H. Resslar, Res. on Speech and Hearing Prog. Rep. No. 8, Indiana University (1983)]. One explanation for this result is that subjects learn to ignore their expectation of hearing an intact word by directing attention to the speech waveform itself, specifically attending to the location of the target phoneme. To test this explanation, two groups of subjects received training in the illusion. Group 1 (Control) saw each stimulus word printed on a CRT screen prior to hearing that word. Group 2 (Directed Attention) also saw each word on the screen, but with the target phoneme underlined in each. The performance of the second group relative to the first has implications both for an explanation of the illusion as a misdirector of attention, as well as for theories of the role of attention and the degree to which it may be controlled in the perception of fluent speech.
73(1983); http://dx.doi.org/10.1121/1.2020420View Description Hide Description
If phonological short‐term memory preserves phonetic detail (as would seem necessary if listeners are to correct misperceptions on the basis of later information), a sequence of words in different dialects might be difficult to recall, because phonological redundancy could not be exploited. Five women (one Scottish English, one ESL with Italian interference, three GA speakers) recorded the digits one, two⋯ten. From these we compiled 20 pseudo‐random, 12‐word single‐talker sequences, 20 three‐talker same‐dialect (GA) sequences, and 20 three‐talker, mixed‐dialect (GA, SE, ESL) sequences. Serial order recall by 18 female GA speakers was tested. Recall of single‐talker sequences was not reliably better if talker and listeners shared dialect (GA), nor did recall of of later words in a sequence vary systematically with sequence‐type. But earlier words were better recalled in three‐talker, same‐dialect than in three‐talker mixed‐dialect sequences, both for all words and for GA words common to both sequence‐types. Implications for the role of phonetic storage in normal listening will be discussed. [Supported by NICHD and NSF.]
73(1983); http://dx.doi.org/10.1121/1.2020421View Description Hide Description
Explanations of phonetic perception describe the conversion of the acoustic signal to phonetic segments. By making the descriptions of phonetically relevant patterns abstract, these explanations have accomodated the great variability evident in the acoustic signals produced by different talkers and in different utterances by the same talker. In order to refine this characterization of perception, we sought to determine the precise limits of phonetically useful information over frequency variation. A frequency transposition test was performed with three sentences, employing the technique of sinusoidal replication of natural utterances. This technique emphasized time‐varying acoustic information, and also controlled for timbre changes over frequency. The results show that a phonetic listening band may exist within the human range of audible frequencies. The relation of this finding to frequency coding, perception of musical pitch, and development of sensory prostheses will be discussed. [Supported by the National Institute of Child Health and Human Development.]
73(1983); http://dx.doi.org/10.1121/1.2020422View Description Hide Description
The effect of processing speech signals with a simple model of the peripheral auditory system will be described. Compared with a standard speech spectrogram or filter bank analysis, this model transforms the signals in each of the frequency, amplitude, and temporal dimensions. In particular, the temporal transformation models the adaptation of auditory neurons to sustained energy in a particular frequency hand, and results in distinctive responses to segments of speech exhibiting rapid changes in frequency or amplitude (consonants). These transformations result in new relationships between the phonetic identity of the speech and the observable characteristics of the transformed signal. Examples of the response of this model to speech signals will be presented, and the response properties that correspond to the phonetic identity of the signals will be discussed. [Work supported by NSF.]
73(1983); http://dx.doi.org/10.1121/1.2020423View Description Hide Description
Petersen et al. [Science 202, 324–327 (1978)] reported that Japanese macaques exhibited a right ear advantage (REA} when listening to a semantically distinctive feature of their vocalizatioos. Several heterospecific monkeys tested in identical conditions with the Japanese monkeys' calls failed to show any reliable ear advantage. However, recent studies with humans have shown that the size and/or direction of ear advantages vary with the physical feature to which subjects attend. This raised the question of whether the species differences in lateralization obtained in our setting represented (1) a processing difference for the same signal feature or (2) a difference between species in the signal feature to which they attended. To choose between those alternatives we ascertained whether Japanese and heterospecific monkeys were in fact attending to the same acoustic feature when they showed their respective patterns of lateralization. Two Japanese and two heterospecific monkeys received extensive training on a discrimination task that required them to classify 15 vocalizations into two categories. During this time the Japanese monkeys exhibited REA's but the heterospecifics showed no ear advantage. The animals were then tested for generalization to 25 novel, natural vocalizations and six synthetic calls. These tests revealed that all four animals were attending to the same feature of the Japanese monkey calls. The animals were then returned to the original training stimuli and they exhibited their earlier patterns of lateralization. Thus this study shows that even when attending to the same physical feature, Japanese and heterospecific monkeys employ different neural processing strategies. This suggests that the communicative valence of the signals used in testing may account for the lateralization differences observed between species. [Work supported by NSF. Deafness Research Foundation and Sloan Foundation.]
73(1983); http://dx.doi.org/10.1121/1.2020424View Description Hide Description
There are a number of LPC‐based speech synthesisintegrated circuits available on the market today. For practical applications of these chips, it is advantageous to utilize an automatic code generation system to help us generate speech code fast. We have developed a pitch‐synchronous, variable frame‐length and variable bit‐rate automatic code generation software system that runs on the HP‐1000 minicomputer. The system performs pitch detection, voice/unvoice/silence classification and then marks analysis frames using perceptual distancemeasures on speech spectra, and also pitch and energy parameters. Using the distance measures, four to five synthetic sounds at different bit rates are generated for each original sound. The system is organized such a way that occasional errors made by it can be corrected manually in a convenient way. A demo tape of sounds synthesized by the General Instrument SP 0256 integrated circuit using the code developed by the automatic code generation system will be played.
73(1983); http://dx.doi.org/10.1121/1.2020474View Description Hide Description
Pitch and duration changes are fairly simple to realize in pitch‐excited LPC vocoders. Recent work [Atal and Remde, Proc. ICASSP 82, pp. 614–617) has shown that use of multi‐pulse excitation provides a significant improvement in the quality of synthetic speech. However, procedures for changing pitch with multi‐pulse excitation are not known. We discuss methods for introducing pitch and duration changes in speech synthesized with multi‐pulse excitation. Pitch is changed by modifying the length of individual pitch periods. Two methods of adjusting the period length were investigated: one by linear scaling of the time axis of the multi‐pulse excitation, the other, by adding or deleting zeros in the excitation. The second method produced very little distortion in the synthetic speech, except in those cases where there was a significant interaction between the first formant and the pitch frequency. The scaling of the time axis produced significantly more distortion. Duration changes are accomplished by adding or removing pitch periods. The LPC area parameters and the amplitude of the major excitmion pulse are interpolated as necessary in creating additional pitch periods.
73(1983); http://dx.doi.org/10.1121/1.2020475View Description Hide Description
The finding that synthetic speech, based on the customary linear model of acoustic speech production, often sounds somewhat machinelike, has motivated various attempts to refine the model. Our approach to the problem concentrates on variations in formant parameters due to the changing termination impedance at the glottis. Open and closed glottis intervals are determined from the electroglottogram that is digitized and recorded together with the speech signal. Linear prediction analysis of the signal segments taken from the closed glottis intervals leads to stable results for both formant frequencies and bandwidths. LP analysis of segments from the open glottis intervals leads to extremely erratic results, which can be proved to be due to pronounced spectral zeros. We will present numerical data on within‐cycle formant variations in the speech of adult males obtained from cepstrally smoothed spectra. [Work supported in part by the Dutch Organization for the Advancement of Pure Research Z.W.O.]
73(1983); http://dx.doi.org/10.1121/1.2020476View Description Hide Description
The vocal tract analog synthesizer generally requires a lot of calculation when a real signal is to be obtained. Thus, when you want to undertake studies, a real time synthesizer is more convenient. One way to carry out sound wave propagation modeling in the vocal tract is to consider the Kelly‐Lochbaum model. In this case, we have to introduce a damping coefficient at the junction of the two elementary tubes. In order to decrease the computation size, we propose to concentrate the losses at three places along the vocal tract. These places are chosen so that the end characteristics (glottis, radiation) will not be disturbed. If there are about 20 elementary cells, the losses are places between cells number 2 and 3, 10 and 11, 14 and 15 (number 1 is next to the glottis). The simulation of this model, including the wall vibration and lip radiation according to the principle proposed by D. Degryse (The 4th F.A.S.E. symposium, Venezia, 193–196 (1981)], gives formantcharacteristics close to those of the distributed model: the maximum error is less than 1% on the first three formant frequencies and 20% on the respective bandwith. Then, the synthesizer requires about 25% fewer computations than without the improvement.
73(1983); http://dx.doi.org/10.1121/1.2020477View Description Hide Description
This report is the first attempt in Japanese speech synthesis generated in the syntactic base. While the authors have acoustic‐phonetically analyzed an amount of speech data, they also have brough forth theoretical refinement to the rule system of Japanese phonology and syntax [J. D. McCawlay, The Phonological Compoment of a Grammar of Japanese (Mouton, The Hague, 1968); H. Yoshiba Ling. Analysis 7, 241–262 (1981)]. Both seek a generative model for the speech data and for verifying the theory that a speech synthesis‐by‐rule system is now in progress comprising phonological rules as well as other higher‐level generations. It is shown that purely phonological interpretation of some of the forms yields ad hoc rules and complicates the system, and hence that by restructuring the syntactic/phonological organization remarkable simplification in the phonological rules can be attained. It is expected that this significant modification in the theory can bring constructive generalization and rearrangement to the synthesis‐by‐rule system of Japanese.