banner image
No data available.
Please log in to see this content.
You have no subscription access to this content.
No metrics data to plot.
The attempt to load metrics for this article has failed.
The attempt to plot a graph for these metrics has failed.
Spatio-temporal articulatory movement primitives during speech production: Extraction, interpretation, and validation
Rent this article for


Image of FIG. 1.
FIG. 1.

(Color online) Vocal tract articulators (marked on a midsagittal image of the vocal tract).

Image of FIG. 2.
FIG. 2.

Gestural score for the word “team.” Each gray block corresponds to a vocal tract action or gesture. See Fig. 1 for an illustration of the constricting organs. Also notice that at any given instant in time, only a few gestures are “on” or “active,” i.e., the activation of the gestural score is in time.

Image of FIG. 3.
FIG. 3.

(Color online) Flow diagram of TaDA, as depicted in .

Image of FIG. 4.
FIG. 4.

(Color online) A screenshot of the Task Dynamics Application (or TaDA) software GUI (after ). Displayed to the left is the instantaneous vocal tract shape and area function at the time marked by the cursor in the temporal display. Note especially the pellets corresponding to different pseudo vocal-tract flesh-points in the top left display, movements of which (displayed in color in the bottom center panels) are used for our experiments. The center panels just above these consist of two overlaid waveforms. There is one panel for each constriction task/goal variable of interest. The square waveforms depict activations of theoretical gestures associated with that task (input to the model), while the continuous waveforms depict the actual waveforms of those task variables obtained as output from the TaDA model.

Image of FIG. 5.
FIG. 5.

(Color online) Schematic illustrating the proposed cNMFsc algorithm. The input matrix can be constructed either from real (EMA) or synthesized (TaDA) articulatory data. In this example, we assume that there are = 7 articulator fleshpoint trajectories. We would like to find = 5 basis functions or articulatory primitives, collectively depicted as the big red cuboid (representing a three-dimensional matrix ). Each vertical slab of the cuboid is one primitive (numbered 1 to 5). For instance, the white tube represents a single component of the third primitive that corresponds to the first articulator ( samples long). The activation of each of these five time-varying primitives/basis functions is given by the rows of the activation matrix in the bottom right hand corner. For instance, the five values in the th column of are the weights which multiply each of the five primitives at the th time sample.

Image of FIG. 6.
FIG. 6.

(Color online) Schematic illustrating how shifted and scaled primitives can additively reconstruct the original input data sequence. Each gold square in the topmost row represents one column vector of the input data matrix, , corresponding to a single sampling instant in time. Recall that our basis functions/primitives are time-varying. Hence, at any given time instant , we plot only the basis functions/primitives that have non-zero activation (i.e., the corresponding rows of the activation matrix at time has non-zero entries). Notice that any given basis function extends = 4 samples long in time, represented by a sequence of four silver/gray squares each. Thus, in order to reconstruct say the fourth column of we need to consider the contributions of all basis functions that are “active” starting anywhere between time instant 1 to 4, as shown.

Image of FIG. 7.
FIG. 7.

(Color online) Akaike Information Criterion (AIC) values for different values of (the number of bases) and (the temporal extent of each basis) computed for speaker We observe that an optimal model selection prefers the parameter values to be as low as possible since the number of parameters in the model far exceeds the contribution of the log likelihood term in computing the AIC.

Image of FIG. 8.
FIG. 8.

(Color online) Histogram of the number of non-zero constriction task variables at any sampling instant.

Image of FIG. 9.
FIG. 9.

(Color online) RMSE for each articulator and phone class (categorized by ARPABET symbol) obtained as a result of running the algorithm on all 460 sentences spoken by male speaker

Image of FIG. 10.
FIG. 10.

(Color online) RMSE for each articulator and phone class (categorized by ARPABET symbol) obtained as a result of running the algorithm on all 460 sentences spoken by female speaker

Image of FIG. 11.
FIG. 11.

(Color online) Histograms of the fraction of variance unexplained (FVU) by the proposed cNMFsc model for MOCHA-TIMIT speakers (a) and (b) . The samples of the distribution were obtained for each speaker by computing the FVU for each of the 460 sentences. (The algorithm parameters used in the model were = 0.65, = 8, and = 10).

Image of FIG. 12.
FIG. 12.

(Color online) Original (dashed) and cNMFsc-estimated (solid) articulator trajectories of selected (left) TaDA articulator variables and (right) EMA (MOCHA-TIMIT) articulator variables (obtained from speaker ) for the sentence “this was easy for us.” The vertical axis in each subplot depicts the value of the articulator variable scaled by its range (to the interval [0,1]), while the horizontal axis shows the sample index in time (sampling rate = 100 Hz). The algorithm parameters used were = 0.65, = 8, and = 10. See Table I for an explanation of articulator symbols.

Image of FIG. 13.
FIG. 13.

(Color online) Spatio-temporal basis functions or primitives extracted from MOCHA-TIMIT data from speaker The algorithm parameters used were = 0.65, = 8 and = 10. The front of the mouth is located toward the left hand side of each image (and the back of the mouth on the right). Each articulator trajectory is represented as a curve traced out by ten colored markers (one for each time step) starting from a lighter color and ending in a darker color. The marker used for each trajectory is shown in the legend.

Image of FIG. 14.
FIG. 14.

(Color online) Average activation pattern of the = 8 basis functions or primitives for (a) voiceless stop consonants, and (b) British English vowels obtained from speaker 's data. For each phone category, 8 colorbars are plotted, one corresponding to the average activation of each of the 8 primitives. This was obtained by collecting all columns of the activation matrix corresponding to each phone interval (as well as – 1 columns before and after) and taking the average across each of the = 8 rows.


Generic image for table

Articulator flesh point variables that comprise the post-processed synthetic (TaDA) and real (EMA) datasets that we use for our experiments. Note that the EMA dataset does not have a Tongue Root sensor, but has an extra maxillary (upper incisor) sensor in addition to the mandibular (jaw) sensor. Also, TaDA does not output explicit spatial coordinates of the velum. Instead it has a single velic opening parameter that controls the degree to which the velopharyngeal port is open. Since this parameter is not a spatial (x, y) coordinate like the other variables considered, we chose to omit this parameter from the analysis described in this paper.

Generic image for table

Top five canonical correlation values between the gestural activation matrix (generated by TaDA) and the estimated activation matrix for both TaDA and EMA cases.


Article metrics loading...


Full text loading...

This is a required field
Please enter a valid email address
752b84549af89a08dbdd7fdb8b9568b5 journal.articlezxybnytfddd
Scitation: Spatio-temporal articulatory movement primitives during speech production: Extraction, interpretation, and validation