Volume 113, Issue 4, April 2003
Index of content:
- SPEECH PROCESSING AND COMMUNICATION SYSTEMS 
113(2003); http://dx.doi.org/10.1121/1.1558356View Description Hide Description
The study investigated the segmental intelligibility of four currently available text-to-speech (TTS) products under 0-dB and 5-dB signal-to-noise ratios. The products were IBM ViaVoice™ version 5.1, which uses formant coding, Festival version 1.4.2, a diphone-based LPC TTS product, AT&T Next-Gen™, a half-phone-based TTS product that uses harmonic-plus-noise method for synthesis, and FlexVoice™2, a hybrid TTS product that combines concatenative and formant coding techniques. Overall, concatenative techniques were more intelligible than formant or hybrid techniques, with formant coding slightly better at modeling vowels and concatenative techniques marginally better at synthesizing consonants. No TTS product was better at resisting noise interference than others, although all were more intelligible at 5 dB than at 0-dB SNR. The better TTS products in this study were, on the average, 22% less intelligible and had about 3 times more phoneme errors than human voice under comparable listening conditions. The hybrid TTS technology of FlexVoice had the lowest intelligibility and highest error rates. There were discernible patterns of errors for stops, fricatives, and nasals. Unrestricted TTS output—e-mail messages, news reports, and so on—under high noise conditions prevalent in automobiles, airports, etc. will likely challenge the listeners.