Index of content:
Volume 136, Issue 2, August 2014
- SPEECH PROCESSING AND COMMUNICATION SYSTEMS 
136(2014); http://dx.doi.org/10.1121/1.4887479View Description Hide Description
An alternative to the spectral overlap assessment metric (SOAM), first introduced by Wassink [(2006). J. Acoust. Soc. Am. 119(4), 2334–2350], is introduced. The SOAM quantifies the intra- and inter-language differences between long–short vowel pairs through a comparison of spectral (F1, F2) and temporal properties modeled with best fit ellipses (F1 × F2 space) and ellipsoids (F1 × F2 × duration). However, the SOAM ellipses and ellipsoids rely on a Gaussian distribution of vowel data and a dense dataset, neither of which can be assumed in endangered languages or languages with limited available data. The method presented in this paper, called the Vowel Overlap Assessment with Convex Hulls (VOACH) method, improves upon the earlier metric through the use of best-fit convex shapes. The VOACH method reduces the incorporation of “empty” data into calculations of vowel space. Both methods are applied to Numu (Oregon Northern Paiute), an endangered language of the western United States. Calculations from the VOACH method suggest that Numu is a primary quantity language, a result that is well aligned with impressionistic analyses of spectral and durational data from the language and with observations by field researchers.
136(2014); http://dx.doi.org/10.1121/1.4884759View Description Hide Description
This study proposes an approach to improve the perceptual quality of speech separated by binary masking through the use of reconstruction in the time-frequency domain. Non-negative matrix factorization and sparse reconstruction approaches are investigated, both using a linear combination of basis vectors to represent a signal. In this approach, the short-time Fourier transform (STFT) of separated speech is represented as a linear combination of STFTs from a clean speech dictionary. Binary masking for separation is performed using deep neural networks or Bayesian classifiers. The perceptual evaluation of speech quality, which is a standard objective speech quality measure, is used to evaluate the performance of the proposed approach. The results show that the proposed techniques improve the perceptual quality of binary masked speech, and outperform traditional time-frequency reconstruction approaches.
A nonuniform sampling technique based on inflection point detection and its application to speech coding136(2014); http://dx.doi.org/10.1121/1.4884882View Description Hide Description
In order to reduce the data amount, the nonuniform sampling (NUS) method detects samples of a signal, such as local maxima and minima. To overcome the sparseness problem of the NUS method, an inflection point detection (IPD) method is proposed to sample a signal nonuniformly. The IPD samples a signal not only at the local maxima and minima, but also at the inflection points where the slope of the signal changes. To show its usefulness, the IPD is applied to speech coding. The encoder transmits the time instants and sample amplitude values of the inflection points. At the receiver, the decoder estimates the sample amplitude values at the noninflection points by interpolating the received information. Simulation results show that the IPD method produces 7% mean square error improvement over the NUS method. With a small threshold to detect inflection points, the proposed coding method shows 0.38−8.72 dB signal-to-noise ratio (SNR) and 0.5−1.3 mean opinion score improvement, compared to the continuously variable slope delta modulation algorithm (CVSDM). The IPD method produces up to 8.5 dB improvement in SNR over the CVSDM at bit error rates (BER) below 5 × 10−5, while the IPD method becomes worse than the CVSDM at BER above 5 × 10−5.