Index of content:
Volume 83, Issue S1, May 1988
- PROGRAM OF THE 115TH MEETING OF THE ACOUSTICAL SOCIETY OF AMERICA
- Session A. Speech Communication I: Analysis, Coding, and Synthesis
- Contributed Papers
83(1988); http://dx.doi.org/10.1121/1.2025248View Description Hide Description
In this paper, a class of generalized time‐frequency representations (GTFR) that have both good time and good frequency characteristics for nonstationary signal analysis is presented. The basis of the approach used is that a time‐frequency representation can, in some cases, be improved by allowing negative value of the spectrum. The representation (and its associated kernel) can then be optimized to enhance the peaks in frequency while still maintaining finite time support. The finite time support property has the significance of preserving onset times of signals and also providing clear representations of fast‐changing spectral peaks. Experiments were performed on simulated data and real speech for comparison of the GTFR with the spectrogram and the pseudo‐Wigner distribution. The results show distinct advantages of the GTFR. For example, onsets of transitions are clearer and, in the case of speech, close formant peaks are easier to distinguish. [Research supported by NSF and Boeing.]
83(1988); http://dx.doi.org/10.1121/1.2025249View Description Hide Description
This paper considers the estimation for formant frequencies and bandwidths from voiced speech signals that are degraded with additive white noise. Each period of the speech signal is divided into the open and closed glottal phases using the electroglottograph signal. The speech signal in the closed phase is modeled as the sum of damped sinusoids, with each sinusoid corresponding to one formant. The analysis procedure thus leads to an estimate of the frequency, bandwidth, and energy of each formant. The results indicate that typically just three formants account for most of the energy in the closed phase. The noise‐robustness properties of the algorithm are increased by using a total least‐squares approach in obtaining the parameters and by using data from multiple consecutive closed phases. This work verifies and extends the results of S. Parthasarathy and D. Tufts [IEEE Trans. Acoust.Speech Signal Process. ASSP‐35, 1241–1249 (1987)]. [Work supported by the OSU Seed grant program.]
83(1988); http://dx.doi.org/10.1121/1.2025292View Description Hide Description
This paper considers the problem of estimating the parameters of a voiced speechmodel that includes the effects of source‐tract interaction. Based on work by Ananthapadmanabha and Fant [Speech Commun. 1, 167–184 (1982)], the voiced speech signal is modeled as the output of a time‐varying vocal tract filter excited by a parametrized glottalsource waveform. The glottalsource waveform [Fant et al., STL/QPSR (1984)] models the main pulse shape of the glottal volume velocity and is described by four parameters. The effect of source‐tract interaction is modeled by varying the frequency and bandwidth of the first formant in synchrony with the glottalsource waveform. An analysis‐by‐synthesis approach is used to estimate the parameters of the model. Algorithms for estimating the parameters of the glottalsource waveform and the vocal tract filter are described. Comparisons of the spectrum of the original and synthesized speech waveforms are presented. [Work supported by the OSU Seed grant program.]
83(1988); http://dx.doi.org/10.1121/1.2025293View Description Hide Description
In low bit rate speech coding systems, the LPC spectral information is often coded in terms of quantized partial correlations or line‐spectral pairs. The distribution of spectral errors with such quantization has been studied in the past. However, not much is known about the distribution of spectral errors when the frequencies and bandwidths of LPC poles are quantized. The quantization of frequencies and bandwidths has important advantages because the perceptual sensitivity to quantization of these parameters can be estimated from data on just‐noticeable differences. In this paper, results will be presented on the spectral error introduced by the quantization of frequencies and bandwidths of LPC poles as a function of the different number of predictor coefficients and the different number of bits used to represent the spectrum. The distribution of spectral errors using frequency‐bandwidth quantization will be compared with other procedures based on quantization of partial correlations and line‐spectral pairs, and relative strengths and weaknesses of these quantization schemes will be discussed.
83(1988); http://dx.doi.org/10.1121/1.2025294View Description Hide Description
On 1 September 1983, Korean Airlines flight 007 was intercepted and destroyed by a Soviet jet fighter after straying into Soviet airspace. Although the flight recorder from KAL 007 was not recovered, its final transmissions were recorded at a Tokyo air traffic control center. The intelligibility of the KAL 007 transmissions is compromised by at least three types of distortion: broadband background noise, narrow‐band noise tone, and frequency‐shift distortion. This paper describes a DSP system that has been developed to counteract the frequency‐shift distortion. The system employs Hilbert transform‐based techniques to introduce a linear compensating frequency shift, the extent of which may be controlled manually by the user or automatically by a separate computer. Tests performed to date with the system on the KAL 007 transmissions have shown it to be a useful tool. Intelligibility improvements have led to independently verified advances in understanding several of KAL 007's final transmissions.
83(1988); http://dx.doi.org/10.1121/1.2025295View Description Hide Description
“Gross” vowel spectrum parameters for vowel classification are of interest to many researchers. In the present study, the “effective formant” F′2, estimated by the large‐band spectral integration (LBI) model [Escudier et al., Acts of the French‐Swedish Seminar, Grenoble, France (1985)], in the classification of natural French front vowels /i/ vs /y/ and /e/ vs /ø/(rounding opposition), is evaluated. In parallel, formants' measures on the same corpus are provided in order to compare with and interpret the results obtained using the LBI model Classification performance is good in the case of /i/‐/y/. The optimal spectral integration window is 2–2.5 Bark large (cf. with the 3–3.5 Bark “critical distance”). The model fails, however, in the /e/‐/ø/ classification. Here, the second formant's frequency is by far winning. The model's weakness apparently resides in its last stage, namely peak estimation. In fact, recent results show that the LBI spectral representation can support the definition of form factors (other than peak position) for the classification of vowels.
83(1988); http://dx.doi.org/10.1121/1.2025296View Description Hide Description
Signal processing algorithms can be used to enhance speech intelligibility by improving the detectability of important acoustic features. It should follow from this that other nonacoustic portions of the signal are a source of experimental error that should be controlled. Nevertheless, certain signal processing techniques produce reliable improvements in intelligibility only when that speech is meaningful. A series of intelligibility tests was carried out with co‐channel speech, which combines the speech of two talkers into a single channel. This speech had been processed using the harmonic magnitude suppression (HMS) technique [Naylor and Boll, ICASSP (1987)] to diminish the effect of the louder talker. Intelligibility testing using target speech of read PB sentences masking read four‐word nonsense sentences revealed no differences between treated and untreated speech. Intelligibility using complete sentences as both masker and target showed significant differences in favor of the HMS treated speech. An analysis of signals and errors will be presented to show how listeners are integrating information from nonsignal sources to improve performance.
83(1988); http://dx.doi.org/10.1121/1.2025297View Description Hide Description
Speech was amplified to a level of plus and minus 5 V for analog‐to‐digital conversion (sampling rate of 20 kHz). It was then quantized to five discrete levels. The 0‐V level was used for speech within the squelch level (squelch was used to mask background noise). Positive speech waves above A volts were assigned a value of 5 V. Speech waves with peak values below this level were assigned a value of A volts. Negative speech waves were quantized to either − 5 V or − A volts similar to the processing of the positive speech waveforms. Processing was done with a PC. Two different A levels were tested on monosyllabic word lists. The first was at the level of the rms value of each word (average rms level across all words was 1.22 V). The second level was at a fixed value of 2 V. Preliminary results yielded a higher percentage word intelligibility. This type of speech processing has application to the development of a tactile hearing aid.
A microcomputer‐based system for high‐speed processing of speech materials using pitch‐synchronous LPC analysis and synthesis83(1988); http://dx.doi.org/10.1121/1.2025340View Description Hide Description
This paper describes a microcomputer‐based system that uses a high‐speed digital signal processor (TMS‐32030) for digitizing and encodingspeechmaterials. The system uses pitch‐synchronous LPC (covariance method), in conjunction with a laryngograph, to analyze previously recorded speechmaterials [M. J. Hunt and C. E. Harvenberg, Proceedings of the 12th International Congress on Acoustics, A4‐2 (1986)]. The speech and laryngograph signals are digitized simultaneously, and speech parameters (F0, energy, formants, and bandwidths) are calculated in real time. The parameters are modified using an interactive editor, and real‐time synthesis allows for auditory monitoring of the original and modified speech samples. The system is currently being used to process speechmaterials for a limited‐vocabulary, word‐concatenation synthesis system [Eady et al., Proc. ICASSP, 1473–1476 (1987)], and also in the development of a demisyllable‐based text‐to‐speech device. [Work supported by National Research Council Canada and Science Council of British Columbia.]
83(1988); http://dx.doi.org/10.1121/1.2025341View Description Hide Description
This paper describes the durational aspects of a speech synthesis system designed to generate words and sentences of English based on smaller synthetic speech units known as demisyllables [O. Fujimura, J. Acoust. Soc. Am. Suppl. 1 59, S55 (1976)]. Each syllable of a word is composed of an initial and a final demisyllable. The demisyllable inventory (totaling less than 900 units) is prerecorded and stored in LPC format. Demisyllables are linked to form syllables, which are then joined into words. Syllable duration is one of the more complicated features of word formation using this technique. Our strategy for producing syllables of appropriate length is to record the demisyllable units with relatively long durations and then to reduce the duration of the demisyllable components when they are joined together. Durational adjustment is accomplished using a spectral distance metric, which identifies regions of the syllable where the spectral components are least dynamic. The less dynamic regions can then be deleted to shorten a syllable, while maintaining perceptually relevant acoustic characteristics. This method is used to adjust syllable durations at both the word and sentence levels. [Work supported by Science Council of British Columbia.]
83(1988); http://dx.doi.org/10.1121/1.2025342View Description Hide Description
Although speech patterns are heavily influenced by the hierarchical linguistic structure of speech, synthesis by rule systems have generally been based on linear utterance representations. Delta is a new programing language that makes it easy to work with utterance representations containing multiple levels. Both higher level linguistic units, such as phrases, syllables, and phonemes, and lower level phonetic events, such as articulatory or formant targets and F0 trends, can be easily accommodated on separate interconnected “streams,” with each unit equally accessible to the rules. While Delta can be used to test most any synthesis model for any language, this paper will show how Delta can be used to test a particular model for English. This model uses, among others, CV, syllable, nucleus, phoneme, formant, and duration streams, with formant transitions represented as duration tokens that are in effect invisible in other streams. The paper will justify the selection of streams and the unique way of handling formant transitions, demonstrating in Delta notation how the model leads to particularly straightforward rules for predicting English phoneme durations, formant values, and aspiration patterns.
83(1988); http://dx.doi.org/10.1121/1.2025343View Description Hide Description
Two main drawbacks make telephoning using a conference system troublesome: First, noise and reverberation produced at the site of the speaker are transmitted to the listener. Second, to achieve a sufficient speech volume at the listener'site, the necessary amplification for loudspeaker playback leads to instability (feedback) of the whole system. To suppress reverberation and noise, a speech controlled microphone arrangement is used. The implemented algorithm for speech detection is able to distinguish between noise and speech up to noise levels of 70 dB SPL. The noise signal may either be noiselike, impulselike, or sinusoidal. The detection algorithm needs a maximum time of 32 ms to detect a speech signal. To avoid instability, a combination of a special filter arrangement and a speech controlled amplification of microphone and loudspeaker signal is used. Two filters, having a transfer characteristic similar to a comb filter, are inserted into the transmitting and the receiving path. The transfer functions of the filters are inverse. With regard to the tone color, the filters were optimized, taking into account the special effects of signal processing used by the human ear. Hearing tests showed that speech intelligibility is not influenced by the filter arrangement.
83(1988); http://dx.doi.org/10.1121/1.2025344View Description Hide Description
Recently, a great deal of interest has been shown in applying the Wigner‐Ville distribution to obtain time‐frequency energy representation of nonstationary signals. The utility of the Wigner distribution has proved useful in analyzing monocomponent signals. In the case of multicomponent signals, the Wigner distribution adds cross terms without any physical significance to the time‐frequency distribution. The presence of cross terms obscures the actual spectral features of interest and makes the results very misleading. To apply the Wigner‐Ville distribution to speech signals, it is essential to remove all cross terms not of interest. This problem may be solved by smoothing the Wigner‐Ville distribution independently in time and frequency directions using the “smoothed” pseudo‐Wigner estimator. This estimator possesses several advantages over the short time periodogram. This presentation will concentrate on the application of the smoothed pseudo‐Wigner distribution (SPWD) to speech signals and will demonstrate the ability of the SPWD to improve the quality of the time‐frequency representation of speech signals. In particular, a comparison will be made between the spectrogram and the SPWD, and it will be shown how high‐frequency spectral features may be easily detected from an SPWD but are far less obvious in the spectrogram. Moreover, it will be shown that the SPWD is more appropriate for the analysis of formant structure.
- Session B. Physical Acoustics I: Nonlinear Acoustics
83(1988); http://dx.doi.org/10.1121/1.2025381View Description Hide Description
This paper describes a ray superposition theory for cumulative growth of nonlinear effects in a two‐dimensional acoustic mode, based on decomposition of the mode into a pair of obliquely propagating, nonlinear planar waves. The mathematical foundation of the formulation is an earlier perturbation analysis of the reflection of a distorted planar wave obliquely incident on the boundary of an infinite half‐space [Z. Qian, Sci. Sin. 25, 492–501 (1982)]. Based on the results of the analysis, each of the pair of rays forming the signal at a selected field point is traced back to its origin at the excitation. Each ray is described as a simple planar wave undergoing finite amplitude distortion that depends on the propagation distance along that ray between field and source points. This distance is the same for each ray at a specified field point, but differences in the excitation at the respective source points result in phase differences between the two rays. The overall signal is shown to be the same as a modal description of the propagation [H. C. Miao and J. H. Ginsberg, J. Acoust. Soc. Am. 80, 911–920 (1986)]. The ray solution explains an apparent paradox in the modal analysis, which indicated that although the signal can be resolved into a pair of planar waves, the distortion process is scaled only by the axial position along the waveguide. Conversely, the earlier solution provides validation for the superposition of rays, as well as for the linear reflection law. [Work supported by NSF and ONR.]
Similarity of a Fourier transform generalization of the Earnshaw solution for planar waves to an interacting wave model for finite amplitude effects in sound beams83(1988); http://dx.doi.org/10.1121/1.2025382View Description Hide Description
The Earnshaw solution for a finite amplitude planar wave, which displays amplitude dispersion, is valid for arbitrary excitation f(t) on a boundary. The form that results for small acoustic Mach numbers when f(t) is represented as the inverse of its Fourier transformF(ω) may be considered to be a coordinate straining of the linearized signal, in which the transformation has the appearance of a Fredholm integral equation in the frequency domain. In the case of an acoustic planar wave, the phase speed of all frequency components is the same, independent of frequency. A generalization of that result is obtained if one considers the possibility that the cumulative growth effect is an arbitrary function of distance. Rather than being an abstraction, this form is shown to be analogous to an earlier solution for finite amplitude sound beams [J. H. Ginsberg, H. C. Miao, and M. A. Foda, J. Acoust. Soc. Am. Suppl. 1 81, S25 (1987)]. The signal in that analysis was represented in terms of a spectrum of transverse wavenumbers, rather than frequencies, and the signal for each wavenumber was formed from interacting quasiconical waves, rather than a single planar wave. Nevertheless, the two problems share a common mathematical structure. A discussion of numerical algorithms for such models leads to some surprising observations regarding the implicit functional form of the Earnshaw solution. [Work supported by ONR.]
83(1988); http://dx.doi.org/10.1121/1.2025383View Description Hide Description
A theory for the nonlinear interaction between two sound beams produced by real sources in a lossless fluid was presented in a previous work [Naze Tjøtta and Tjøtta, J. Acoust. Soc. Am. 83, 487–495 (1988)]. A general solution of the governing equation in the quasilinear approximation, valid at any range, crossing angle, and frequency ratio, was obtained for prescribed boundary conditions. An asymptotic formula for the sum and difference frequency sound pressure was obtained at large distance from the sources. It relates the amplitude and directivity of the sound field in the farfield to the on‐source conditions. In the present work, numerical results are presented for various types of sources (uniform piston, Gaussian source, focusing source). The influence of source geometry (separation, and intersection angles from 0° and 90°) and frequency on the beam pattern of the nonlinearly generated sound is studied. Conditions are also given to determine when scattering of sound by sound can be observed. In the special case of thin Gaussian beams intersecting at a small angle, the results are compared with that presented by Darvennes and Hamilton [J. Acoust. Soc. Am. Suppl. 1 83, S4 (1988)] using the paraxial approximation. [Work supported by the IR&D program of ARL:UT, and VISTA/STRATOIL, Norway.]
83(1988); http://dx.doi.org/10.1121/1.2025384View Description Hide Description
The scattering of sound by sound from Gaussian beams that interact at small angles is investigated theoretically with a quasilinear solution of the Khokhlov‐Zabolotskaya nonlinear parabolic wave equation. The analytical solution, which is valid throughout the entire paraxial field, is a generalization of the result obtained for parametric receiving arrays by Hamilton, Naze Tjøtta, and Tjøtta [J. Acoust. Soc. Am. 82, 311–318 (1987)]. Significant levels of scattered difference frequency sound are shown to exist outside the nonlinear interaction region. An asymptotic formula reveals that difference frequency sound is scattered in the approximate direction of k 1‐k 2, where k i is the wave vector associated with the direction and frequency of the ith primary beam. Computed propagation curves and beam patterns demonstrate the dependence of the scattered radiation on source separation, frequency ratio, interaction angle, and focusing. Results are also presented for the scattered sum frequency sound. Comparisons are made with the general asymptotic results presented by Berntsen, Naze Tjøtta, and Tjøtta [J. Acoust. Soc. Am. Suppl. 1 83, S4 (1988)], which are valid for arbitrary interaction angles, source separations, and amplitude distributions. [Work supported by ONR.]
Acoustical phase conjugation experiments: The generation of a reversed wave through three‐wave mixing in a layer of stabilized microbubbles83(1988); http://dx.doi.org/10.1121/1.2025429View Description Hide Description
A phase‐conjugate mirror is one that reverses an incident wave front so that it propagates back toward the source. Recent experiments [Kustoy et al., Sov. Phys. Acoust.32, 500–504 (1986)] indicate that wavefront reversal can be established through the interaction of a pump wave of frequency f 1 with a probe wave of frequency.f 2 <f 1 that diverges from a point source. Three‐wave (nonlinear) mixing occurred in a layer of freely rising gas bubbles in water so as to produce a reversed wave having a frequency f 3 = f 1 − f 2. The present research yields evidence of reversed wave generation resulting from three‐wave mixing in a layer of stabilized microbubbles. One method of stabilization is to use the gas‐filled micropores of a Nuclepore ® polycarbonate membrane. Our experiments were carried out in a water tank of diameter 164 cm and with frequencies f 1 and f 2 > 300 kHz so as to avoid spurious boundary reflections. The previously predicted [P. L. Marston, J. Acoust. Soc. Am. Suppl. 1 82, 12–13 (1987)] longitudinal and transverse focal point shifts for the reversed wave are investigated. The signal for the reversed wave at frequency f 3 is enhanced through the use of a background subtraction technique. [Work supported by ONR.]
83(1988); http://dx.doi.org/10.1121/1.2025430View Description Hide Description
The propagation of finite amplitude soundwaves produced by real sources in an inhomogeneous and thermoviscous fluid is considered. A governing nonlinear equation in the sound pressure amplitude is derived using the methods of singular perturbations. It consistently accounts for the effects of diffraction, dissipation, nonlinearity, and inhomogeneity, and represents a generalization of the parabolic equation valid for a homogeneons fluid (Khokhlov‐Zabolotskaya‐Kuznetsov equation) discussed in a previous work [Naze Tjøtta and Tjøtta, J. Acoust. Soc. Am. 69, 1644–1652 (1980)]. The equation also applies to the case of sound beams produced by strongly curved sources, for example, focusing and defocusing sources. The relationship to the equations of classical ray theory is discussed. [Work supported by The Norwegian Research Council for Sciences and Humanities (NAVF), the IR&D program of ARL:UT, and VISTA/STATOIL, Norway.]
83(1988); http://dx.doi.org/10.1121/1.2025431View Description Hide Description
A first approximation for identifying foci in weak shocks is to consider the focusing process for a linear discontinuity. This is sometimes known as the “acoustic shock” approximation. A case studied previously is the longitudinal cusp of “arête” located where three rays merge at the focus of a converging cylindrical wave; for finite amplitudes, the wave can become unstable with respect to the formation of a pair of shock‐shocks in the focal region [B. Sturtevant and V. Kulkarny, J. Fluid Mech. 73, 651–671 (1976)]. In the present research, catastrophe theory is used to identify wave‐front shapes W(x,y) that propagate to produce novel acoustic foci with caustics spread out roughly transverse to the direction of propagation [P. L. Marston, J. Acoust. Soc. Am. 81, 226–232 (1987) and Proceedings of the APS 1987 Topical Conference on Shock Waves in Condensed Matter (in press)]. The initial displacement of the wave front from the xy plane is given by W(x,y). Linear waves with , where the shape parameter a 2≠0, propagate to produce transverse cusps described by a cubic cusp curve. A stronger focus, the hyperbolic umbilic, is produced if with a 1⩽0. A focal section is produced at a distance z = − (2a 1)−1 from the xy plane. In this section, caustic lines form a V with an apex angle of 2 arctan (γ/α)1/2 Issues related to shock stability are noted. [Work supported by ONR.]