Processing speech signal using auditory-like filterbank provides least uncertainty about articulatory gestures
(Color online) Articulator data from the X-ray MicroBeam speech production database (Ref. 13). (a) Pellets are placed on eight critical points of the articulators viewed on the midsagittal plane of the speaker, namely, upper lip (UL), lower lip (LL), tongue tip (T1), tongue body (T2 and T3), tongue dorsum (T4), mandibular incisors (MNI), and mandibular molars (MNM). The XY coordinate values of these pellets are recorded. (b) The X and Y coordinate values of eight pellets are shown when a male speaker was uttering “but special.”
Parametric representation of filterbanks with different center frequency and bandwidth relationships. (a) The red dashed curve is a warping function which maps the linear frequency axis ω to a warped frequency axis ωα such that a uniform filterbank on the ωα axis is equivalent to a nonuniform filterbank on the ω axis, which has a center frequency and bandwidth relationship identical to that of a cochlear filterbank. (b) The warping function is parameterized by the parameter α. Different warping functions are shown for different choices of α (blue curves). The red dashed curve shows the warping function corresponding to the cochlear filterbank and is identical to the red dashed curve in (a).
The MI between the acoustic feature and the speech articulation for different filterbanks in cases of an English male speaker [(a)–(b)] and an English female speaker [(c)–(d)]. (a) One hundred different filterbanks were generated by randomly selecting α between −0.95 and 0.95. MI was obtained by averaging 20 different estimates of MI for each choice of α. The SD of MI over these 20 estimates is of the order of 10−3. Maximum MI 0.2637 occurs for α = −0.5752. (b) The warping function and the filterbank are plotted for α = −0.5752. The red dashed curve shows the warping function corresponding to the cochlear filterbank. MI computed using cochlear filterbank is 0.2630 (c) and (d), (a) and (b) repeated for the female speaker. Maximum MI 0.4576 occurs for α = −0.5728. MI computed using cochlear filterbank is 0.4540.
(Color online) The MI between the acoustic feature and the speech articulation for different filterbanks in cases of Cantonese [(a)–(c)] and Georgian [(d)–(e)] subjects. One hundred different filterbanks (i.e., different α on X-axis of each plot) were generated by randomly selecting α between −0.95 and 0.95. MI was obtained by averaging 20 different estimates of MI for each choice of α. The SD of MI over these 20 estimates is of the order of 10−3. Maximum MIs in (a)–(e) occur for α = −0.5728, −0.5196, −0.6308, −0.5752, and −0.5752, respectively. Note that α = −0.6 corresponds to the empirically established cochlear filterbank. This suggests that the empirically established cochlear filterbank is a near optimal filterbank, whose output provides maximum information about the articulatory gestures.
The MI between the acoustic feature and the speech articulation across different speakers. Along the axis labeled “speaker number,” the first 40 points correspond to the English subjects followed by three Cantonese subjects and then followed by two Georgian subjects. (a) Speaker-specific range of α over which the MI is more than 90% of the maximum MI. All ranges are on the negative side indicating that filterbanks with cochlea-like nonuniform frequency resolution achieve near-maximum MI. Blue dots indicate the filterbank corresponding to maximum MI. (b) Percentages of randomly chosen one hundred filterbanks (FBs), which yield less MI than that obtained by the cochlear filterbank. The dashed line corresponds to an average percentage of 92.20%.
Genders of the subjects corresponding to two different ranges of optimal α (α > −0.7 and α < −0.7).
Article metrics loading...
Full text loading...