Volume 73, Issue S1, May 1983
Index of content:
- PROGRAM OF THE 105TH MEETING OF THE ACOUSTICAL SOCIETY OF AMERICA
- Session TT. Architectural Acoustics V: Criteria Evaluation of Acoustical Spaces in Laboratory and Field Environments
- Contributed Papers
73(1983); http://dx.doi.org/10.1121/1.2020196View Description Hide Description
The validity of current physical room acoustic criteria developed by various authors in the laboratory has been tested for the first time under the environmental complexity of live concerts. Six independent subjective factors emerged from evaluations on semantic rating scales, namely body, clarity, tonal quality, proximity, spaciousness, and intimacy. In a correlation analysis between subjective factors and measured values of physical room acoustic parameters, the factors clarity and intimacy were not found to be related with any of the physical parameters. The factor body correlated with the ratio of 80‐ms Late‐to‐Early energy, tonal quality with the texture of the impulse response decay, and proximity with the direct‐to‐reverberant energy ratio and distance from platform. Each of the factors body, tonal quality, and proximity were also found to be the effect of an additional objective influence which remained unidentified [A. G. Sotiropoulou, Ph.D. thesis, University of London (1982)]. These results show that there is redundancy in current physical room acoustic criteria, and that the list of physical criteria which account for good acoustics is still incomplete. [Thanks go to Dr. David B. Fleming for invaluable advice.]
Sound isolation rating system for use when the sound source is music, machinery, or mechanical equipment73(1983); http://dx.doi.org/10.1121/1.2020197View Description Hide Description
A rating system is presented for use in the selection of sound isolating components when the sound source is music, machinery, or mechanical equipment. The rating system described is complementary to the STC rating system which works well as a tool for simplifying the selection of components for the isolation of “speech like” sound sources. It is possible by considering both rating numbers to more accurately, and yet simply, compare and evaluate different construction systems as to their appropriateness for the intended use.
73(1983); http://dx.doi.org/10.1121/1.2020198View Description Hide Description
The low‐frequency sound field in nonrectangular enclosures was examined by using the finite element method. Five rooms were examined; one of which was rectangular and was used as a benchmark for comparison. The other four rooms were perturbations of the Louden room. Progressively complicated changes were made in order to assess the effect on the low‐frequency sampling statistics of a change in boundary shape. All of the rooms had parallel floors and ceilings to represent a real life constraint. The source and receiver averaged response fluctuation statistics for the three lowest one‐third octave bands defined in the narrow‐band qualification standard (ISO‐3742) were computed for each room. Further, the receiver averaged response was calculated in order to determine the effect of source position. The statistics were tabulated for twelve source positions in each room for each one‐third octave band considered. The results of this study show that the low‐frequency sampling statistics of rectangular reverberation chambers can be improved by relatively simple boundary modifications.
73(1983); http://dx.doi.org/10.1121/1.2020199View Description Hide Description
New construction of multifamily living units often includes noise insulation performance testing. The degree and source of deviation of field test (FSTC and FIIC) values from those of the prototype laboratory specimen are generally not predictable. Field testing according to ASTME336‐77 (FSTC) and ISO140 VII‐78 (FIIC) were performed. In this test series, flanking and the site background noise was controlled and octave‐band data was used to reduce test time [A. J. Campanella, J. Acoust. Soc. Am. Suppl. I 69, S8 (1981)]. Testing was conducted in partially completed units including wood framing, windows, doors, wiring, limited HVAC ducting. Wall FSTC tests generally agreed with laboratory tests, but the wood floors of rooms of reduced dimensions exhibited serious FSTC and impactsound level deficiencies amounting to as much as 10 to 14 dB in the 125‐Hz octave band. Additional 63‐Hz octave‐band data indicated that the floor/ceiling structures exhibited low TL values in either the 63‐ or 125‐Hz octave bands. This phenomenon was linked to bar‐like resonance in short joist lengths (8 to 10 ft vs 14 ft in the lab test). Accelerometer measurements on exposed joists showed significant joist vibration in the free‐end mode for the 125‐Hz band under impact excitation. A layer of gypsum board was added to the subfloor sandwich to effect mass loading and added damping of the free‐bar joist vibrational mode so that the 125‐Hz FTL was greater than that of the 63‐Hz octave band. Acceptable FSTC values were achieved by in addition placing the ceiling gypsum board on separate 2×4 subjoists with an added layer of R‐11 insulation.
73(1983); http://dx.doi.org/10.1121/1.2020200View Description Hide Description
A suburban school district experimented with painting acoustic ceilings in a 25‐year‐old elementary school building in an attempt to improve the appearance of the deteriorating tiles. Several ceilings were painted using a latex primer and acrylic semigloss paint. Following complaints by teachers the painting program was suspended and other approaches were tried to increase absorption in rooms with painted ceilings and/or decrease the effect of painting in rooms still to be painted. Measurements of reverberation were made for each condition in an effort to correlate the effect of painting acoustic ceilings.
73(1983); http://dx.doi.org/10.1121/1.2020201View Description Hide Description
Almost all closed form solutions of the problems in duct acoustics (and in other branches of acoustics) are obtained by assuming that the eigenfunctions depend on each of the coordinates independently of the other. Such “separable” solutions are impossible if, for example, the boundary conditions are nonuniform. Under these conditions, instead of resorting to ad‐hoc numerical procedures, closed form solutions could still be obtained by using nonseparable modes. Examples of circumferentially and/or axially nonuniform boundary conditions are worked out to illustrate this procedure.
- Session UU. Speech Communication VIII: Recognition and Intelligibility
73(1983); http://dx.doi.org/10.1121/1.2020202View Description Hide Description
A field study was initiated to learn about the effects of various telephone transmission and switching conditions on the algorithms currently used in the Bell Laboratories, LPC‐based, isolated word recognizer. Digit recordings were obtained from customers over a variety of transmission facilities. During a 23‐day recording period a total of 11 035 isolated digits were recorded. For each recording statistics were recorded about the line condition, the background environment, and the customer's ability to speak his telephone number as a sequence of isolated digits. Also recorded was information about the ability of the automatic work endpoint detector to find each spoken digit and to accurately determine the correct endpoints. The results of several recognition tests are presented—one in which a previously defined set of laboratory created digit reference templates was used, and several others where new sets of reference templates were created from a subset of the recorded digits. It is shown that the performance of the recognizer is poor (average digit accuracy of 77.4%) using the laboratory template set, but improves substantially (average digit accuracy 93.1%) for a template set created from the field recordings. An explanation of the reasons for this improvement in digit recognition accuracy is presented along with their implications to future work in isolated word recognition.
73(1983); http://dx.doi.org/10.1121/1.2020203View Description Hide Description
Currently, a test utterance is recognized as word “A” when the distance score of the test utterance to a reference template for word “A” is the lowest among the measured scores between test and all reference utterances. The reference template is generated by: (1) clustering utterances from many speakers and (2) choosing the cluster centers as reference templates. Thus, the distance between the test utterance and the reference template is the distance of the test utterance to the cluster center. This type of distance measure ignores the fact that clusters have different sizes. Using a Gaussian model for the distribution of utterances inside a cluster, this paper proposes a new distance measure based on the size of the cluster. The correlation between the Itakura distance measure and the new distance measure for many utterance (∼1500) is fairly low (∼0.2). This suggests that the new distance measure represents new information. Incorporating this new distance measure can reduce the recognition error rate in half.
73(1983); http://dx.doi.org/10.1121/1.2020204View Description Hide Description
A conversational mode‐connected speech recognitionsystem which simulates the function of an airline ticket agent is described. The system which allows a complete spoken dialog over dialed‐up telephone lines comprises a syntax‐directed dynamic programming temporal alignment algorithm [C. S. Meyers and S. E. Levinson, IEEE Trans. Acoust. Speech Signal Process. ASSP‐30, 561–565 (1982)] for acoustic pattern recognition and a semantic processor [S. E. Levinson and K. L. Shipley, Bell Syst. Tech. J. 59, 119–137 (1980)] that controls an audio response system making two‐way speech communication possible. The system is highly robust and operates on‐line in ten times real time in speaker trained mode (or 60 times in speaker‐independent mode) on a 32‐bit minicomputer/array processor combination. The heterarchical nature of the system allows it some intelligent processing such as the ability to recognize incorrect, incomplete, improbable, and/or conflicting information in the context of the transaction and to pose questions which identify the difficulty thereby allowing communication to proceed in a natural way.
73(1983); http://dx.doi.org/10.1121/1.2020205View Description Hide Description
Endpoint detection is a critical issue for several types of isolated utterance recognizers, because improper endpoints often result in recognition errors. Endpoint errors often stem from nonspeech artifacts, namely lip smacks, tongue and teeth clicks, and breath noise. Endpoint detectors based only on energy thresholds cannot correctly reject these artifacts, but adding a word model allows most of these artifacts to be properly rejected. The rules which implement the word model are (1) the word cannot begin or end with two released plosives, (2) word initial stop gaps are less than 120 ms and word final ones less than 200 ms, (3) a word must contain a vocalic nucleus and be at least 100 ms in length, (4) word final sounds containing only mid‐frequency energy are breath noise. The detection algorithm has been implemented on a Heuristics Speech Recognizer and tested using the Texas Instruments isolated word data base. The word model based system substantially reduced the error rate relative to an energy threshold based system.
73(1983); http://dx.doi.org/10.1121/1.2020206View Description Hide Description
Recent experiments [M. Bush et al., Proc. 1983 IEEE ICASSP] have demonstrated the ability of a trained spectrogram reader to identify initial stops in /CVb/ syllables from a table of numerical acoustic measurements with approximately 80% accuracy. This paper discusses an automatic system for discriminating between the voiceless plosives (/p,t,k/)which is based on the features and rules identified in these experiments. Ten binary features are extracted from two linear prediction spectra which are computed during the 35 ms following the consonant release. Typical features include “back‐k‐release‐spectrum” and “compact‐release‐spectrum.” The features are detected by examining the frequencies and amplitudes of the local maxima and minima of the two LPC spectra, in a manner motivted by the actions of the human spectrogram reader. A simple statistical classifier is used to combine the outputs of the ten feature detectors. The classifier was trained on the 108 /p,t,k/ tokens of the multi‐speaker corpus used in the spectrogram reading experiment. When tested on the training data, the system achieved 96% correct recognition. When tested on two additional data sets of similar composition, the system achieved scores of 94% and 92%, respectively.
73(1983); http://dx.doi.org/10.1121/1.2020207View Description Hide Description
Ten male talkers who were raised in Southern California were recorded with a high‐quality dynamic microphone while making survey calls on public attitudes towards crime. A week later they were recorded again. A recording of one of the talkers made during the first recording session was played to the listeners. After delays of either one or two weeks the listeners returned and heard all ten second recordings of the talkers. The listeners were told that the voice of the talker they heard the first time might appear once, more than once, or not at all. Listeners were asked to say whether each voice they heard was the voice they heard in the first listening session, and how confident they were in their decision. If they decided a voice was not the voice they had heard earlier, they were asked to indicate how similar it was to that voice. Signal detection analysis showed that listeners maintained an essentially stable hit rate at the cost of an increasing false alarm rate in the two week as opposed to the one week interval.
73(1983); http://dx.doi.org/10.1121/1.2020208View Description Hide Description
An initial step in wide vocabulary, continuous speech recognition is proposed that roughly consists of a schematization of the features seen in conventional spectrograms—e.g., peaks and edges of spectral energy concentrations, temporal discontinuities, and spectral balance information. In a second step, these features can be mapped onto their acoustic‐phonetic correlates—e.g., formant distribution, voice onsets, articulatory closures. A method for locating spectral energy concentrations is given that takes advantage of their usual continuity in time, and thus performs superior to locating peaks in spectral cross sections. It begins with smoothing and flattening convolutions in both the time and frequency dimensions of narrow‐band spectrograms to select the appropriate temporal and spectral scales. Ridges in the resulting two‐dimensional (time‐frequency) surfaces correspond to local spectral energy concentrations. The tops of these ridges are found by the application of a two‐dimensional differential operator at each point in the time‐frequency plane. The operator's definition in terms of the relationship between the gradient and principal directions will be given, along with justification and examples.
73(1983); http://dx.doi.org/10.1121/1.2020209View Description Hide Description
Two groups of subjects were presented with spectrograms of 50 words they had never seen before and were asked to provide a single monosyllabic English word for each spectrogram. One group had learned to identify a limited set of speech spectrograms after 15 h of training using a study‐test procedure which stressed wholistic word identification. A subgroup of participants in a formal course on spectrogram reading at MIT served as a second group of subjects. These subjects learned specific acoustic and phonetic principles and strategies for interpreting spectrograms. Subjects in the first group correctly identified 33% of the possible phonetic segments in spectrograms they had never seen before. The second group of subjects correctly identified 40% of the possible segments in the same set of spectrograms given to the first group. When the data were scored for correct manner class of consonants only, the two groups did not differ significantly. Detailed descriptions of the identification results will be presented. Implications of these findings for developing visual aids for hearing impaired persons and improved phonetic recognition strategies will be discussed. [Supported by NSF.]
73(1983); http://dx.doi.org/10.1121/1.2020210View Description Hide Description
Design criteria for a visual speech display system for the deaf are being investigated. Of particular interest is reading speed for displays of the type that could be produced by an automatic speech recognition device. An adaptive procedure has been developed for measuring reading speed at a prescribed level of performance (e.g., 70% of displayed sentences interpreted correctly). This technique has been used in measuring reading speed as a function of sentence length, (3, 6, or 12 words), display format (stationary, flowing, accumulative), and mode of representation (Standard English text, phonetically based symbols). Extremely high reading speeds (>500 words/min) were obtained using standard English text and very short sentences. Reading speeds comparable to normal speech rates were obtained with a simple, phonetically based representation and moderate sentence lengths (e.g., 6 words/sentence). Experiments are currently in progress on the effects on reading speed of eliminating word boundaries and adding stress markers. [Research supported by PHS Grant ♯PO1‐NS 17764.]
73(1983); http://dx.doi.org/10.1121/1.2020211View Description Hide Description
The band‐importance function developed by Bell Laboratories is not directly applicable to any of the available recorded speech tests. For this reason we developed a test which is like that used by the Bell Laboratories but unlike other available tests in that (1) each CV or VC is a different utterance (2) the phonemes appear with uniform frequency. The same phonemes used by the Bell Laboratories were used. A “test” consists of utterances by one male and one female talker each speaking 152 CV and VC's following a carrier phrase. The frequency spectrum of this material, as well as the procedure used for its estimate, will be described. Practice effects were studied at different magnitudes of distortion. The importance function which is presumably applicable to this material was found valid within the error of measurement for the filtered speech articulation test results. The error in predicting the articulation score will be related to the magnitude of the intrinsic variability of articulation testing. [Work supported by NINCDS.]
73(1983); http://dx.doi.org/10.1121/1.2020212View Description Hide Description
The implications of various assumptions utilized in different approaches to the articulation predictions were tested utilizing several filtered speech, interfering noise and low level listening conditions. The monotonicity principle, basic to the AI theory, was examined and found valid when certain plausible adjustments were made in the speech signal and auditory sensitivity levels. The possibility of different intelligibility‐articulation index transfer functions for amplitude distortion and noise interference was evaluated. An analysis of prediction results obtained by using different calculation techniques revealed that they did not agree. The band independency principle was investigated, both as related to the general validity of the AI concept and also as related to the test materials and calculation procedures used. Variations of certain assumptions within a given calculation technique (e.g., real peaks versus 12 dB across all frequencies) were found to have only minor effects on the accuracy of predictions. [Work supported by NINCDS.]
Comparison of interconsonantal differences and initial‐syllable error responses for words heard in quiet and noise73(1983); http://dx.doi.org/10.1121/1.2020213View Description Hide Description
Consonants, heard incorrectly, are replaced in error by speech sounds that are similar to the target sound. The Interconsonantal Difference (ID) [J. W. Black, Report 17, Proj. 2928, The Ohio State University Research Foundation, Columbus, OH, 1974] offers a method by which errors in aurally perceived words, graphically transcribed, may be compared. A sample of 6000 words was recorded, and reproduced to listeners under conditions of “quiet” and “noise,” yielding 250 000 errors. The frequency of error occurrence was correlated with the respective ID, for each of 54 consonants or consonantal clusters. Results indicated high correlation values where error phonemes were associated with front placement, voicing, and plosiveness, in “quiet.” Addition of “noise” resulted in an increase of correlational values, despite the larger overall number of errors made in transcription.
73(1983); http://dx.doi.org/10.1121/1.2020214View Description Hide Description
It is common knowledge that the bulk of the intelligibility‐relevant information in English speech is carried by the consonants, which are generally of lower intensity and shorter duration than the vowels. This led us to hypothesize that, except for its effect on voicing information, whispered speech would permit more efficient use of communication channel space than would normal speech. The Diagnostic Rhyme Test (DRT) was used to test this hypothesis with noise‐masked speech. Under extreme S/N conditions, normal speech yielded higher DRT total scores than did whispered speech, but, over the range from −6 to +9 dB S/N ratio, whispered speech yielded significantly higher scores. When the contribution of the voicing scale to total DRT scores was removed, whispered speech yielded uniformly higher scores. The DRT score for voicing was predictability the most affected by whispering. However, significantly greater than chance amounts of voicing information survived all but the least favorable conditions. Noise caused a substantial loss of information with respect to several features in normal speech, while having negligible effects in the case of whispered speech.
73(1983); http://dx.doi.org/10.1121/1.2020215View Description Hide Description
Consonant errors in noise and reverberation were compared. In CVC syllables the variables were 24 initial or 21 final consonants that exist in English words while the constant consonant was /t/ and the vowels were /ɑ/ and /i/. All syllables were produced by a male talker in a carrier sentence: “Mark the /CVC/ again.” The tests were processed through a room with reverberation time, T = 1.2 s or a babble of 20 voices was added. The following conditions were tested: (1) no noise and no reverberation; (2) no noise, T = 1.2 s; and (3) no reverberation, noise at speech‐to‐noise ratio, S/N = + 5 dB. Each condition was tested 20 times by 10 normal hearing subjects. The responses were tabulated in 12 matrices: 2 vowels×2 positions of consonants×3 test conditions. Transmitted information, TI, was calculated for selected distinctive features for each matrix separately. Analysis of variance was performed for each distinctive feature. Many main effects and two‐way interactions were significant. Diffferences in errors in noise and reverberation will be discussed. [Supported by NIH.]