Illustration of the (A) classical and (B) proposed approaches to formant tracking. Key advantages to the proposed KARMA approach include intra-frame observation of autoregressive moving average parameters for both formant and antiformant tracking, inter-frame tracking using linearized Kalman inference, and the availability of both point estimates and uncertainties for each trajectory.
Comparison of extended Kalman filter (solid) and particle filter (dashed) tracking performance in terms of root-mean-square error (RMSE) averaged over 25 Monte Carlo trials and reported with 95% confidence intervals (gray).
Estimated formant tracks on spectrogram of VTR utterance 19 by an adult female from New England: “Nice country to meet a lion in face to face.” Reference trajectories from the VTR database are shown (red, dashed) along with the formant frequency tracks (blue, solid) from (A) KARMA, (B) wavesurfer, and (C) praat. Overall root-mean-square error (RMSE) is reported across all formants and frames labeled as speech, in addition to separate RMSE values for f 1, f 2, and f 3. The KARMA output additionally displays uncertainty (gray shading, ±1 standard deviation) for each formant trajectory. Frames are categorized using TIMIT labels of phonetic class: vowel (blue), semivowel/glide (green), nasal (cyan), fricative (magenta), affricate (red), stop (black).
Effect of bandwidth tracking and state covariance matrix Q on KARMA formant tracking. VTR utterance 10 by an adult female from New England: “Reading in poor light gives you eyestrain.” (A) Bandwidths fixed to baseline values in Table III with diagonal elements of Q equal to (224 Hz)2, (B) bandwidth values tracked with Q , as in (A), and (C) bandwidth values tracked with diagonal elements of Q increased to (949 Hz)2. Overall root-mean-square error (RMSE) is reported across all formants and frames labeled as speech in addition to separate RMSE values for f 1, f 2, and f 3. Color coding as in Fig. 3.
Illustration of the output from KARMA for the synthesized utterance /nɑn/. (A) True trajectories (red, dashed) are shown with the mean estimates (solid blue for formants, solid green for antiformants) and uncertainties (gray shading) for each frequency and bandwidth. (B) plots an alternative display is shown with a wideband spectrogram along with estimated frequency and bandwidth tracks of formants (blue) and antiformants (green). The 3-dB bandwidths dictate the width of the corresponding frequency tracks.
KARMA output for three spoken nasal consonants: (A) /m/, (B) /n/, and (C) /ŋ/. On the left, spectrograms overlay the mean estimates (blue for formants, green for antiformants) and uncertainties (gray shading) for each frequency and bandwidth. Plots to the right display the corresponding periodogram (gray) and spectral ARMA model fit (black).
KARMA formant and antiformant tracks of utterance by adult male: “piano.” Displayed are the (A) wideband spectrogram of the speech waveform and (B) the spectrogram overlaid with formant frequency estimates (blue), antiformant frequency estimates (green), and uncertainties (±1 standard deviation) for each track (gray). Arrows indicate beginning and ending of utterance. Note that the increase in uncertainty during silence regions.
Kalman-based formant tracks using the (A) parametric ARMA cepstrum and (B) nonparametric real cepstrum as observations. VTRsynth f0 waveform is a synthesized version of VTR utterance 1: “Even then, if she took one step forward, he could catch her.” Color coding as in Fig. 3.
The extended Kalman algorithms for yielding point estimates and associated uncertainties of tracked parameters. See text for definition of variables.
Proposed KARMA algorithm for formant and antiformant tracking.
Modifiable parameters and their baseline values for the three steps in the proposed KARMA approach.
Formant tracking performance of KARMA, wavesurfer, and praat in terms of root-mean-square error (RMSE) taken per formant across all 516 utterances in the VTR database (Deng et al., 2006b). Reported RMSE (in Hz) is computed over speech-labeled frames and further categorized by 6 phonetic classes.
Formant tracking performance of KARMA, wavesurfer, and praat in terms of root-mean-square error (RMSE) taken per formant across all 516 utterances in the VTR database (Deng et al., 2006b). RMSE (in Hz) is reported over speech-labeled frames and further categorized by speaker gender (male, female).
RMSE (in Hz) of KARMA, wavesurfer, and praat formant tracking of the first three formant trajectories in the VTRsynth database that resynthesizes utterances using a stochastic source.
RMSE of KARMA, wavesurfer, and praat formant tracking of the first three formant trajectories in the VTRsynthf0 database that resynthesizes VTR database utterances using stochastic and periodic sources. RMSE (in Hz) is reported over speech-labeled frames and further categorized by original speaker gender (male, female) to reveal any fundamental frequency effects.
Article metrics loading...
Full text loading...