1887
banner image
No data available.
Please log in to see this content.
You have no subscription access to this content.
No metrics data to plot.
The attempt to load metrics for this article has failed.
The attempt to plot a graph for these metrics has failed.
Processing speech signal using auditory-like filterbank provides least uncertainty about articulatory gestures
Rent:
Rent this article for
USD
10.1121/1.3573987
/content/asa/journal/jasa/129/6/10.1121/1.3573987
http://aip.metastore.ingenta.com/content/asa/journal/jasa/129/6/10.1121/1.3573987

Figures

Image of FIG. 1.
FIG. 1.

(Color online) Articulator data from the X-ray MicroBeam speech production database (Ref. 13). (a) Pellets are placed on eight critical points of the articulators viewed on the midsagittal plane of the speaker, namely, upper lip (UL), lower lip (LL), tongue tip (T1), tongue body (T2 and T3), tongue dorsum (T4), mandibular incisors (MNI), and mandibular molars (MNM). The XY coordinate values of these pellets are recorded. (b) The X and Y coordinate values of eight pellets are shown when a male speaker was uttering “but special.”

Image of FIG. 2.
FIG. 2.

Parametric representation of filterbanks with different center frequency and bandwidth relationships. (a) The red dashed curve is a warping function which maps the linear frequency axis ω to a warped frequency axis ωα such that a uniform filterbank on the ωα axis is equivalent to a nonuniform filterbank on the ω axis, which has a center frequency and bandwidth relationship identical to that of a cochlear filterbank. (b) The warping function is parameterized by the parameter α. Different warping functions are shown for different choices of α (blue curves). The red dashed curve shows the warping function corresponding to the cochlear filterbank and is identical to the red dashed curve in (a).

Image of FIG. 3.
FIG. 3.

The MI between the acoustic feature and the speech articulation for different filterbanks in cases of an English male speaker [(a)–(b)] and an English female speaker [(c)–(d)]. (a) One hundred different filterbanks were generated by randomly selecting α between −0.95 and 0.95. MI was obtained by averaging 20 different estimates of MI for each choice of α. The SD of MI over these 20 estimates is of the order of 10−3. Maximum MI 0.2637 occurs for α = −0.5752. (b) The warping function and the filterbank are plotted for α = −0.5752. The red dashed curve shows the warping function corresponding to the cochlear filterbank. MI computed using cochlear filterbank is 0.2630 (c) and (d), (a) and (b) repeated for the female speaker. Maximum MI 0.4576 occurs for α = −0.5728. MI computed using cochlear filterbank is 0.4540.

Image of FIG. 4.
FIG. 4.

(Color online) The MI between the acoustic feature and the speech articulation for different filterbanks in cases of Cantonese [(a)–(c)] and Georgian [(d)–(e)] subjects. One hundred different filterbanks (i.e., different α on X-axis of each plot) were generated by randomly selecting α between −0.95 and 0.95. MI was obtained by averaging 20 different estimates of MI for each choice of α. The SD of MI over these 20 estimates is of the order of 10−3. Maximum MIs in (a)–(e) occur for α = −0.5728, −0.5196, −0.6308, −0.5752, and −0.5752, respectively. Note that α = −0.6 corresponds to the empirically established cochlear filterbank. This suggests that the empirically established cochlear filterbank is a near optimal filterbank, whose output provides maximum information about the articulatory gestures.

Image of FIG. 5.
FIG. 5.

The MI between the acoustic feature and the speech articulation across different speakers. Along the axis labeled “speaker number,” the first 40 points correspond to the English subjects followed by three Cantonese subjects and then followed by two Georgian subjects. (a) Speaker-specific range of α over which the MI is more than 90% of the maximum MI. All ranges are on the negative side indicating that filterbanks with cochlea-like nonuniform frequency resolution achieve near-maximum MI. Blue dots indicate the filterbank corresponding to maximum MI. (b) Percentages of randomly chosen one hundred filterbanks (FBs), which yield less MI than that obtained by the cochlear filterbank. The dashed line corresponds to an average percentage of 92.20%.

Tables

Generic image for table
TABLE I.

Genders of the subjects corresponding to two different ranges of optimal α (α > −0.7 and α < −0.7).

Loading

Article metrics loading...

/content/asa/journal/jasa/129/6/10.1121/1.3573987
2011-06-14
2014-04-24
Loading

Full text loading...

This is a required field
Please enter a valid email address
752b84549af89a08dbdd7fdb8b9568b5 journal.articlezxybnytfddd
Scitation: Processing speech signal using auditory-like filterbank provides least uncertainty about articulatory gestures
http://aip.metastore.ingenta.com/content/asa/journal/jasa/129/6/10.1121/1.3573987
10.1121/1.3573987
SEARCH_EXPAND_ITEM