Volume 76, Issue S1, October 1984
Index of content:
- PROGRAM OF THE 108TH MEETING OF THE ACOUSTICAL SOCIETY OF AMERICA
- Session A. Speech Communication I: Speech Analysis and Synthesis
- Contributed Papers
76(1984); http://dx.doi.org/10.1121/1.2021742View Description Hide Description
Commercially available parallel processors with special cellular computer architecture are able to perform certain nonlinear transformations, viz., the covering transformations much more efficiently than Von Neumann type machines. Additionally, these computers' interactive image processing capabilities often result in discovering insightful ways of approaching old problems. It is natural to inquire whether the inherent advantages of cellular processors would be of benefit in speech analysis. This paper discusses the early results obtained by treating the speech spectrogram as an “image” and analyzing this “image” on a Cytocomputer™. Results include the characterization of the distinctive features of speech in mathematical morphological terms. Examples of morphological transformations of speech spectrograms will be presented which suggest applications in speech analysis, automatic speech recognition,speech synthesis, bandwidth compression, and speech enhancement.
76(1984); http://dx.doi.org/10.1121/1.2021743View Description Hide Description
Standard linear predictive (LP) analysis does not represent several important properties of the human auditory system, such as decreasing spectral resolution with frequency and nonlinear intensity‐loudness characteristics. The proposed technique, perceptually based LP analysis, introduces nonlinear frequency and spectral amplitude weighting by approximating critical‐band‐weighted speech spectra using spectral transform LP modeling [H. Hermansky, H. Fujisaki, and Y. Sato, Proc. ICASSP 1983, pp. 777–780]. The critical‐band spectral weighting simulates the nonlinearity of human auditory frequency resolution, the amplitude axis root spectral transform simulates Stevens' intensity‐loudness power law, and the LP modeling emphasizes the perceptual importance of spectral envelope peaks. Experiments show that the proposed technique yields more consistent estimates of perceptually important spectral peaks and provides an efficient parametric representation of speech.
76(1984); http://dx.doi.org/10.1121/1.2021744View Description Hide Description
There is an unpleasant artifact in narrow‐band spectrograms that is caused by filtering. If the filter straddles the trough between two pitch bars of equal amplitude, even if the bars lie out in the skirts of the filter, they beat together causing spots in the spectrogram (at the fundamental frequency). Where pitch bars have small slopes this sometimes gives rise to a look of “beads‐on‐a‐chain” in the trough. At high slopes it can break up the pitch bars, producing apparent lines that may be orthogonal to the pitch bars. The artifact can be eliminated by smoothing in the direction of the pitch bars. This direction can be found during calculation of digital (e.g., FFT) spectrograms. Nearby spectra (sections) of voiced speech are essentially identical, except that one is a “stretch” of the other. The stretch that makes them the most similar specifies the direction of the pitch bars at every frequency. Local smoothing of the picture in this direction eliminates the artifact without loss or smearing of speech information. Spectrograms of speech with and without smoothing will be shown. Directional smoothing may also be useful in enhancing noisy speech.
76(1984); http://dx.doi.org/10.1121/1.2021745View Description Hide Description
It is well known that the frequencies of the low‐order driving point impedance poles and zeros of the vocal tract can be used to determine a smoothed representation of the vocal tract area function. The impedance poles are congruent with the vocal tractformants; the pole frequencies are specified by the location of peaks in the speech spectrogram. The impedance zero frequencies are indirectly related to formant bandwidths. The bandwidths, however, are difficult to accurately measure from speech. A method is presented for estimating vocal tract impedance zero frequencies directly as the peaks of the transfer function relating the acoustic speech signal to the acceleration signal from a transducer taped to the skin of the neck at the thyroid notch. This transfer function is estimated from pressure and acceleration signals using an autoregressive technique. Area function reconstructions derived from estimated pole and zero frequencies from human speech are presented. [Work supported by NIH.]
76(1984); http://dx.doi.org/10.1121/1.2021746View Description Hide Description
Much past research has been devoted to deciding, for any given speech segment, whether it is “voiced” or “voiceless.” In reality, however, speech often combines both glottal and supraglottal excitation, so that such a binary classification can be essentially invalid. We have begun a study of speech produced with mixed excitation with the two major goals of (1) determining the best analytic tools for this type of analysis and (2) examining a variety of real and modeled speech signals for their excitation characteristics. Initial research strategies and preliminary results will be discussed. [Research supported by NSA.]
76(1984); http://dx.doi.org/10.1121/1.2021794View Description Hide Description
This paper shows a new method to extract voiced source information to construct a high quality vocoder. As voiced source information, we can directly extract peak points and amplitudes of pitch pulses by estimating the envelope characteristic of the amplitudes of the time‐domain residual signals. Then we create the following algorithm adapting LPC analysis. First, the squared time‐domain residual signals are regarded as the “pseudo‐frequency‐domain power spectra.” Next, these signals are symmetrized and the inverse FFT is performed. Through these computations, the autocorrelation coefficients of the “pseudo‐frequency‐domain power spectra” are obtained. Then, LPC analysis is performed on these coefficients, and linear prediction coefficients a i are obtained. Next, the envelope signals of the magnitude spectra are computed by performing an FFT on the a i. Finally, the envelope characteristic of the time‐domain residual signals regarded as the power spectra is obtained. From these envelope signals, we can very easily extract exact peak points and amplitudes of pitch pulses during voiced segments. By using this method, a little change of pitch pulses during transition segments between voiced and unvoiced segments can be exactly extracted. We can synthesize a high quality voice using a PARCOR vocoder.
76(1984); http://dx.doi.org/10.1121/1.2021795View Description Hide Description
A very high time‐resolution pitch detector is proposed. Pitch periods can be very exactly obtained from each pitch. In the frequency domain, the magnitude of second harmonic of voiced speech is 10 dB larger than the fundamental magnitude at most. Therefore the fundamental component is made into the strongest one through a 20‐dB/oct FIR low‐pass filter. Pitch periods can be obtained from the time interval of filtered voice peaks. The fundamental peaks of the time domain waveform are separated from other false peaks by the dynamic threshold which is defined by the moving average of filtered and half‐rectified voice. This method is very useful for analysis and/or processing which need exact pitch periods. In addition, since this method does not have an analysis time window, it has high performance for both high and low pitch speaker.
76(1984); http://dx.doi.org/10.1121/1.2021796View Description Hide Description
Linear predictive coding analysis (LPC) has been advanced as a technique for constructing an inverse filter which recovers the glottal airflow wave from the acoustic speechwave [D. Y. Wong, J. D. Markel, and A. H. Gray, Jr., IEEE Trans. Acoust. Speech Signal Process. ASSP‐27, 350–355 (1979); M. G. Berouti, D. G. Childers, and A. Paige, IEEE ICASSP, 33–36 (1977)]. Because of the problem of interaction of LPC analysis with the glottal source, the LPC analysis interval is restricted to the closed glottis interval. This requirement is a problem in analyzing breathy and falsetto voice in males and all voice types in females. A model reference technique has been developed which constructs the inverse filter from information contained in an entire pitch period. The technique computes the inverse filter by performing a joint optimization of a model of the glottal pulse together with the inverse filter coefficients. A striking comparison is obtained between the model reference technique and LPC applied to female voice and falsetto male voice. [Work supported by NIH.]
76(1984); http://dx.doi.org/10.1121/1.2021797View Description Hide Description
Speech system design has been limited in the past by power consumption considerations, data format versus system interface trade‐offs, and general system complexity in multi‐chip systems. A CMOSspeech synthesizer/ROM/processor, recently designed, solves these problems. The device consists of an 8‐bit speech processor, 8K bytes of ROM, LPC‐10 lattice filter, and a pulse width modulating D/A. The general purpose data I/O port is designed consistent with TTL interface requirements and may be directly interfaced to four‐ and eight‐bit microprocessors and also used directly with a key matrix. Since the chip has an onboard ROM, it provides a single chip system for many speech applications. The programmability of the speech processor in the device makes it sufficiently flexible to process various speech data formats to achieve specific synthesis requirements.
76(1984); http://dx.doi.org/10.1121/1.2021798View Description Hide Description
A text‐to‐speech system has three linguistically related components that each affect the intelligibility of the speech it produces. These three components are: (1) the quality of the individual sounds and how they fit together (or how the algorithm fits them together), (2) the accuracy of the output of the ASCII to phonemic unit algorithms and/or rules, and (3) the quality of the prosodic algorithms, including work‐stress and pausegroup implementations. Text‐to‐speech development, therefore, should involve the testing of each of these components for the purposes of maximizing the accuracy and acceptability of each component's output. This presentation will therefore discuss an implementation of testing algorithms for phonemic construction intelligibility, for ASCII‐to‐phone accuracy, and for prosodic naturalness for the development of a text‐to‐speech system for the T.I. Professional Computer.
76(1984); http://dx.doi.org/10.1121/1.2021799View Description Hide Description
Based on an analysis of a corpus of recorded CVCV syllables and several paragraphs [C. Aoki, D. Klatt, and H. Kawasaki, “Acoustic‐Phonetic Analysis of Japanese,” J. Acoust. Soc. Am. Suppl. 1 75, S60 (1984)], as well as a study of the prior literature on analysis and synthesis of Japanese, we have formulated a set of synthesis rules within the general framework used in DECtalk. Input to the system must be specified phonemically. The program is divided into several subprograms that (1) parse this string into phoneroes and associated structural/accent features, (2) apply phonological rules to select appropriate allophones or delete segments, (3) assign a duration to each segment, (4) specify onset times and strength of pitch rises and falls, (5) compute 17 time functions to control a formantsynthesizer on the basis of stored tables of phonetic target values and smoothing time constants, and, finally, (6) compute a waveform from the control parameter specification. The program has been optimized by systematic spectral/waveform comparisons between the synthetic output and recordings of a selected model speaker. The oral presentation will emphasize differences between the English and Japanese synthesis systems. A demonstration will be played. [Work supported in part by an NIH grant.]
Synthesis of fundamental frequency contours of a Japanese sentence using the junction rule of the phrase component76(1984); http://dx.doi.org/10.1121/1.2021843View Description Hide Description
A simple rule is proposed to generate fundamental frequency (F0) contours for a standard Japanese sentence. This rule has two steps. In step 1, we generate the F0 contours of Japanese words. We decompose the F0 contours of Japanese words into phrase components and accent components in this model. Phrase components are gradually decreasing ones by utterance and are simply approximated on the logarithmic scale. Accent components are local humps, and are obtained by simple smoothing of a square wave on the logarithmic scale. Next, we regard the Japanese sentence as a junction of words through this model, so the F0 contour of the sentence is generated by a new junction rule in step 2. Junction types are classified into four types according to accent types. It is important to decide the F0 of the beginning of each word. This frequency is a raised up one compared with that of the previous word's end and is decided by the difference between the beginning frequency of the sentence and the frequency of each word's end. By using this method, we can generate F0 contours more simply. Perceptual tests are done using synthesized speech, and very natural intonations are obtained.
76(1984); http://dx.doi.org/10.1121/1.2021844View Description Hide Description
This paper will describe a computer program for synthesizing English fundamental frequency contours (F0 contours) which attempts comprehensive coverage of the different English melodies. The system's phonological representation of F0 contours is based on Pierrehumbert's work. [“The Phonology and Phonetics of English Intonation,” Ph.D. dissertation, MIT (1980)]. Experiments on production motivate our treatment of pitch range and its interaction with choice of melody [Liberman and Pierrehumbert, “Intonational Invariance Under Changes in Pitch Range and Length,” Language Sound Structure (MIT Press, Cambridge, MA, 1984)]. A set of rules has been developed for aligning crucial features of the F0 contour with the speech segments. The program was tested both with LPC‐encoded natural speech and with synthetic speech. A demonstration tape will be played at the meeting.
76(1984); http://dx.doi.org/10.1121/1.2021845View Description Hide Description
Voice quality can serve to distinguish individuals, social classes, and linguistic groups. It can mark the difference between male and female voices and between “normal” and pathological or synthetic voices. The goal of this study is a better understanding of these differences in order to synthesize more natural and more varied voices, in particular, female voices. Inverse filtering was used to retrieve the glottal waveform from isolated monosyllables produced by four women and four men. The resulting waveforms were examined at onset, at their midpoints, and at offset. Analysis included temporal and spectral measures. Significant differences were found between the male and female groups. The results of the analysis were used to develop a set of parameters to characterize the glottal cycle and its changes within an utterance. A tape will be played demonstrating the effects of varying these parameters.
76(1984); http://dx.doi.org/10.1121/1.2021846View Description Hide Description
In general, the hearing‐impaired children can recognize the voices of their teachers better than those of normal persons. The purpose of this study is to make clear the difference between the teacher's voice and a normal person's. In this study, we analyze the Japanese vowels of the teacher and normal persons, and compare their acoustic characteristics. The formant frequency (F1, F2, F3) and the power of the formant frequency (P1, P2, P3) are chosen as the phonetic features, and we analyze them by using our new automatic formant extractor. As a result of the experiments, there is no difference between the teacher's voice and the normal person's on both F1 and F3. But the F2 of teacher's /i/ and /e/ is higher than the normal person's. The F2 of the teacher's /u/ is lower, namely the frequency distances between /i/‐/u/ and /e/‐/o/ of F2 are larger. Both P2 and P3 of teacher's are higher than those of the normal person's. It is shown by this study that misrecognitions of /i/ and /u/ and /e/ and /o/ are decreased among hearing‐impaired children by the above‐mentioned reason.
- Session B. Physiological Acoustics I and Psychological Acoustics I: Physiology and Psychophysics of Intensity Discrimination and Perception
- Invited Papers
76(1984); http://dx.doi.org/10.1121/1.2021890View Description Hide Description
Intensity discrimination is a topic of long‐standing interest in audition because it reflects processes which are fundamental to auditory perception. This paper reviews the basic psychophysical phenomena of intensity discrimination and focuses on those aspects which appear most relevant to possible mechanisms by which intensity is coded at the periphery. Of special concern is the contrast between the enormous dynamic range shown in intensity discrimination and the typically limited dynamic range of single fibers in the the auditory nerve “bottleneck.” [Supported by NINCDS grant NS12125.]
76(1984); http://dx.doi.org/10.1121/1.2021891View Description Hide Description
Changes in sound intensity generally produce dynamic or transient changes in the responses of auditory neurons in addition to sustained or steady‐state changes. These dynamic components appear to enhance the response to changes in intensity and potentially play an important role in intensity discrimination. The distinction between dynamic and steady‐state responsiveness appears to be minimal in responses of hair cells, significant in responses of auditory nerve fibers, and still greater in some units of the cochlear nucleus. Results will be reviewed comparing responses in the three regions to step, ramp, and sinusoidally modulated changes in intensity. Underlying mechanisms will be discussed that may be responsible for producing the observed increases in both the magnitude and operating range of the dynamic response components.
76(1984); http://dx.doi.org/10.1121/1.2021892View Description Hide Description
The mechanisms by which intensity is coded in the discharges of single auditory nerve fibers have not been fully determined, although hypotheses involving changes in discharge rate fiber recruitment have been proposed. In this paper, we will review the characteristics of auditory nerve fiber responses that may code for loudness growth and intensity discrimination. Psychophysical dynamic ranges, which are in excess of 100 dB, cannot be accounted for by the 30–40 dB dynamic range displayed by the majority of individual auditory nerve fibers. Recently, however, a subpopulation of auditory nerve fibers has been identified within each characteristic frequency (CF) region. The characteristics of these fibers include low spontaneous activities, a wide range (greater than 70 dB) of thresholds, high sensitivities to changes in intensity (i.e., steep rate‐intensity functions), and dynamic ranges as high as 60–70 dB. These results indicate that the responses of fibers having similar CFs and spanning the normal range of thresholds may account for a large portion of the behavioral dynamic range. Furthermore, intensity discrimination experiments designed to limit the spread of excitation (Hellman, 1974; Viemeister, 1974), suggest that intensity coding does not depend on recruitment of neighboring fibers. Finally, phase‐locking exists for stimulus frequencies below 5 kHz, and phase‐locking thresholds are on the average 15 dB lower than corresponding discharge rate thresholds. Thus, the overall dynamic range associated with auditory nerve responses can account for nearly the entire psychophysicalloudness range. Other characteristics of neuronal response, such as input/output nonmonotonicities and synchrony suppression in the coding of multi‐component stimuli, will be discussed.
76(1984); http://dx.doi.org/10.1121/1.2021931View Description Hide Description
Models of intensity discrimination are reviewed. The rationale and mathematical structure of types of models are discussed in relation to both psychophysical and physiological data. Particular attention is paid to models that incorporate ideas or data from peripheral physiological studies. Models are distinguished on the basis of the structure of the information on single channels and on the basis of the comparison rules for combining channels. In particular, the question of time structure versus rate structure of information on individual channels and optimum versus nonoptimum channel combination rules will be addressed. In addition to a review of available models for discrimination performance, attention is given to estimating the performance that could be achieved by optimum use of specific aspects of the auditory‐nerve patterns (e.g., average‐rate, time‐synchrony, and spread‐of‐excitation information). In general, our analysis demonstrates the sensitivity of model predictions to details of the auditory‐nerve data that are difficult to estimate from available data. Based on available data, average‐rate information appears adequate to achieve observed performance, although it seems more likely that the various types of information are combined differently in different experimental circumstances. [Work supported by U.S. Public Health Service (NINCDS).]
76(1984); http://dx.doi.org/10.1121/1.2021932View Description Hide Description
The discrimination of intensity plays a central role in Fechnerian theories of the perception of intensity: Perceptual differences between two stimuli are determined by the cumulative number of jnd's which separate the intensities. We review recent research which suggests that this view must be modified to account for the total number of jnd's in the listener's dynamic range or in the range of intensities used in an experiment. Specifically, loudness balance measurements are better predicted if the number of jnd's is normalized by the total number of jnd's in the dynamic range. Resolution between two intensities in identification experiments reflects not only the number of jnd's separating the intensities, but also the total number of jnd's in the stimulus range and the proximity of the intensities to perceptual anchors. These results indicate the need for complementing our increasing understanding of intensity discrimination with an improved understanding of perceptual anchors and the limits of the dynamic range.