^{1,a)}, Louis Goldstein

^{2}and Shrikanth S. Narayanan

^{3}

### Abstract

This paper presents a computational approach to derive interpretable movement primitives from speech articulation data. It puts forth a convolutive Nonnegative Matrix Factorization algorithm with sparseness constraints (cNMFsc) to decompose a given data matrix into a set of spatiotemporal basis sequences and an activation matrix. The algorithm optimizes a cost function that trades off the mismatch between the proposed model and the input data against the number of primitives that are active at any given instant. The method is applied to both measured articulatory data obtained through electromagnetic articulography as well as synthetic data generated using an articulatory synthesizer. The paper then describes how to evaluate the algorithm performance quantitatively and further performs a qualitative assessment of the algorithm's ability to recover compositional structure from data. This is done using pseudo ground-truth primitives generated by the articulatory synthesizer based on an Articulatory Phonology frame-work [Browman and Goldstein (1995). “Dynamics and articulatory phonology,” in Mind as motion: Explorations in the dynamics of cognition, edited by R. F. Port and T.van Gelder (MIT Press, Cambridge, MA), pp. 175–194]. The results suggest that the proposed algorithm extracts movement primitives from human speech production data that are linguistically interpretable. Such a framework might aid the understanding of longstanding issues in speech production such as motor control and coarticulation.

This work was supported by NIH Grant R01 DC007124-01. The authors thank Prasanta Ghosh for help with the EMA data processing.

I. MOVEMENT PRIMITIVES AND MOTOR CONTROL

A. Notation

II. REVIEW OF DATA-DRIVEN METHODS TO EXTRACT MOVEMENT PRIMITIVES

III. VALIDATION STRATEGY

IV. DATA

V. PROBLEM FORMULATION

A. Nonnegative Matrix Factorization and its extensions

B. Extraction of primitive representations from data

C. Selection of optimization free parameters

VI. RESULTS AND VALIDATION

A. Quantitative performance metrics

B. Qualitative comparisons of TaDA model predictions with the proposed algorithm

1. Comparison with gestural scores

2. Significance of extracted synergies

C. Visualization of extracted basis functions

VII. DISCUSSION AND FUTURE WORK

VIII. CONCLUSIONS

### Key Topics

- Speech
- 35.0
- Speech production models
- 26.0
- Trajectory models
- 18.0
- Phonetic segments
- 14.0
- Speech production
- 13.0

##### G10L

## Figures

(Color online) Vocal tract articulators (marked on a midsagittal image of the vocal tract).

(Color online) Vocal tract articulators (marked on a midsagittal image of the vocal tract).

Gestural score for the word “team.” Each gray block corresponds to a vocal tract action or gesture. See Fig. 1 for an illustration of the constricting organs. Also notice that at any given instant in time, only a few gestures are “on” or “active,” i.e., the activation of the gestural score is sparse in time.

Gestural score for the word “team.” Each gray block corresponds to a vocal tract action or gesture. See Fig. 1 for an illustration of the constricting organs. Also notice that at any given instant in time, only a few gestures are “on” or “active,” i.e., the activation of the gestural score is sparse in time.

(Color online) Flow diagram of TaDA, as depicted in Nam et al. (2012) .

(Color online) A screenshot of the Task Dynamics Application (or TaDA) software GUI (after Nam et al., 2006 ). Displayed to the left is the instantaneous vocal tract shape and area function at the time marked by the cursor in the temporal display. Note especially the pellets corresponding to different pseudo vocal-tract flesh-points in the top left display, movements of which (displayed in color in the bottom center panels) are used for our experiments. The center panels just above these consist of two overlaid waveforms. There is one panel for each constriction task/goal variable of interest. The square waveforms depict activations of theoretical gestures associated with that task (input to the model), while the continuous waveforms depict the actual waveforms of those task variables obtained as output from the TaDA model.

(Color online) A screenshot of the Task Dynamics Application (or TaDA) software GUI (after Nam et al., 2006 ). Displayed to the left is the instantaneous vocal tract shape and area function at the time marked by the cursor in the temporal display. Note especially the pellets corresponding to different pseudo vocal-tract flesh-points in the top left display, movements of which (displayed in color in the bottom center panels) are used for our experiments. The center panels just above these consist of two overlaid waveforms. There is one panel for each constriction task/goal variable of interest. The square waveforms depict activations of theoretical gestures associated with that task (input to the model), while the continuous waveforms depict the actual waveforms of those task variables obtained as output from the TaDA model.

(Color online) Schematic illustrating the proposed cNMFsc algorithm. The input matrix V can be constructed either from real (EMA) or synthesized (TaDA) articulatory data. In this example, we assume that there are M = 7 articulator fleshpoint trajectories. We would like to find K = 5 basis functions or articulatory primitives, collectively depicted as the big red cuboid (representing a three-dimensional matrix W). Each vertical slab of the cuboid is one primitive (numbered 1 to 5). For instance, the white tube represents a single component of the third primitive that corresponds to the first articulator (T samples long). The activation of each of these five time-varying primitives/basis functions is given by the rows of the activation matrix H in the bottom right hand corner. For instance, the five values in the tth column of H are the weights which multiply each of the five primitives at the tth time sample.

(Color online) Schematic illustrating the proposed cNMFsc algorithm. The input matrix V can be constructed either from real (EMA) or synthesized (TaDA) articulatory data. In this example, we assume that there are M = 7 articulator fleshpoint trajectories. We would like to find K = 5 basis functions or articulatory primitives, collectively depicted as the big red cuboid (representing a three-dimensional matrix W). Each vertical slab of the cuboid is one primitive (numbered 1 to 5). For instance, the white tube represents a single component of the third primitive that corresponds to the first articulator (T samples long). The activation of each of these five time-varying primitives/basis functions is given by the rows of the activation matrix H in the bottom right hand corner. For instance, the five values in the tth column of H are the weights which multiply each of the five primitives at the tth time sample.

(Color online) Schematic illustrating how shifted and scaled primitives can additively reconstruct the original input data sequence. Each gold square in the topmost row represents one column vector of the input data matrix, V, corresponding to a single sampling instant in time. Recall that our basis functions/primitives are time-varying. Hence, at any given time instant t, we plot only the basis functions/primitives that have non-zero activation (i.e., the corresponding rows of the activation matrix at time t has non-zero entries). Notice that any given basis function extends T = 4 samples long in time, represented by a sequence of four silver/gray squares each. Thus, in order to reconstruct say the fourth column of V, we need to consider the contributions of all basis functions that are “active” starting anywhere between time instant 1 to 4, as shown.

(Color online) Schematic illustrating how shifted and scaled primitives can additively reconstruct the original input data sequence. Each gold square in the topmost row represents one column vector of the input data matrix, V, corresponding to a single sampling instant in time. Recall that our basis functions/primitives are time-varying. Hence, at any given time instant t, we plot only the basis functions/primitives that have non-zero activation (i.e., the corresponding rows of the activation matrix at time t has non-zero entries). Notice that any given basis function extends T = 4 samples long in time, represented by a sequence of four silver/gray squares each. Thus, in order to reconstruct say the fourth column of V, we need to consider the contributions of all basis functions that are “active” starting anywhere between time instant 1 to 4, as shown.

(Color online) Akaike Information Criterion (AIC) values for different values of K (the number of bases) and T (the temporal extent of each basis) computed for speaker fsew0. We observe that an optimal model selection prefers the parameter values to be as low as possible since the number of parameters in the model far exceeds the contribution of the log likelihood term in computing the AIC.

(Color online) Akaike Information Criterion (AIC) values for different values of K (the number of bases) and T (the temporal extent of each basis) computed for speaker fsew0. We observe that an optimal model selection prefers the parameter values to be as low as possible since the number of parameters in the model far exceeds the contribution of the log likelihood term in computing the AIC.

(Color online) Histogram of the number of non-zero constriction task variables at any sampling instant.

(Color online) Histogram of the number of non-zero constriction task variables at any sampling instant.

(Color online) RMSE for each articulator and phone class (categorized by ARPABET symbol) obtained as a result of running the algorithm on all 460 sentences spoken by male speaker msak0.

(Color online) RMSE for each articulator and phone class (categorized by ARPABET symbol) obtained as a result of running the algorithm on all 460 sentences spoken by male speaker msak0.

(Color online) RMSE for each articulator and phone class (categorized by ARPABET symbol) obtained as a result of running the algorithm on all 460 sentences spoken by female speaker fsew0.

(Color online) RMSE for each articulator and phone class (categorized by ARPABET symbol) obtained as a result of running the algorithm on all 460 sentences spoken by female speaker fsew0.

(Color online) Histograms of the fraction of variance unexplained (FVU) by the proposed cNMFsc model for MOCHA-TIMIT speakers (a) msak0 and (b) fsew0. The samples of the distribution were obtained for each speaker by computing the FVU for each of the 460 sentences. (The algorithm parameters used in the model were Sh = 0.65, K = 8, and T = 10).

(Color online) Histograms of the fraction of variance unexplained (FVU) by the proposed cNMFsc model for MOCHA-TIMIT speakers (a) msak0 and (b) fsew0. The samples of the distribution were obtained for each speaker by computing the FVU for each of the 460 sentences. (The algorithm parameters used in the model were Sh = 0.65, K = 8, and T = 10).

(Color online) Original (dashed) and cNMFsc-estimated (solid) articulator trajectories of selected (left) TaDA articulator variables and (right) EMA (MOCHA-TIMIT) articulator variables (obtained from speaker msak0) for the sentence “this was easy for us.” The vertical axis in each subplot depicts the value of the articulator variable scaled by its range (to the interval [0,1]), while the horizontal axis shows the sample index in time (sampling rate = 100 Hz). The algorithm parameters used were Sh = 0.65, K = 8, and T = 10. See Table I for an explanation of articulator symbols.

(Color online) Original (dashed) and cNMFsc-estimated (solid) articulator trajectories of selected (left) TaDA articulator variables and (right) EMA (MOCHA-TIMIT) articulator variables (obtained from speaker msak0) for the sentence “this was easy for us.” The vertical axis in each subplot depicts the value of the articulator variable scaled by its range (to the interval [0,1]), while the horizontal axis shows the sample index in time (sampling rate = 100 Hz). The algorithm parameters used were Sh = 0.65, K = 8, and T = 10. See Table I for an explanation of articulator symbols.

(Color online) Spatio-temporal basis functions or primitives extracted from MOCHA-TIMIT data from speaker msak0. The algorithm parameters used were Sh = 0.65, K = 8 and T = 10. The front of the mouth is located toward the left hand side of each image (and the back of the mouth on the right). Each articulator trajectory is represented as a curve traced out by ten colored markers (one for each time step) starting from a lighter color and ending in a darker color. The marker used for each trajectory is shown in the legend.

(Color online) Spatio-temporal basis functions or primitives extracted from MOCHA-TIMIT data from speaker msak0. The algorithm parameters used were Sh = 0.65, K = 8 and T = 10. The front of the mouth is located toward the left hand side of each image (and the back of the mouth on the right). Each articulator trajectory is represented as a curve traced out by ten colored markers (one for each time step) starting from a lighter color and ending in a darker color. The marker used for each trajectory is shown in the legend.

(Color online) Average activation pattern of the K = 8 basis functions or primitives for (a) voiceless stop consonants, and (b) British English vowels obtained from speaker msak0's data. For each phone category, 8 colorbars are plotted, one corresponding to the average activation of each of the 8 primitives. This was obtained by collecting all columns of the activation matrix corresponding to each phone interval (as well as T – 1 columns before and after) and taking the average across each of the K = 8 rows.

(Color online) Average activation pattern of the K = 8 basis functions or primitives for (a) voiceless stop consonants, and (b) British English vowels obtained from speaker msak0's data. For each phone category, 8 colorbars are plotted, one corresponding to the average activation of each of the 8 primitives. This was obtained by collecting all columns of the activation matrix corresponding to each phone interval (as well as T – 1 columns before and after) and taking the average across each of the K = 8 rows.

## Tables

Articulator flesh point variables that comprise the post-processed synthetic (TaDA) and real (EMA) datasets that we use for our experiments. Note that the EMA dataset does not have a Tongue Root sensor, but has an extra maxillary (upper incisor) sensor in addition to the mandibular (jaw) sensor. Also, TaDA does not output explicit spatial coordinates of the velum. Instead it has a single velic opening parameter that controls the degree to which the velopharyngeal port is open. Since this parameter is not a spatial (x, y) coordinate like the other variables considered, we chose to omit this parameter from the analysis described in this paper.

Articulator flesh point variables that comprise the post-processed synthetic (TaDA) and real (EMA) datasets that we use for our experiments. Note that the EMA dataset does not have a Tongue Root sensor, but has an extra maxillary (upper incisor) sensor in addition to the mandibular (jaw) sensor. Also, TaDA does not output explicit spatial coordinates of the velum. Instead it has a single velic opening parameter that controls the degree to which the velopharyngeal port is open. Since this parameter is not a spatial (x, y) coordinate like the other variables considered, we chose to omit this parameter from the analysis described in this paper.

Top five canonical correlation values between the gestural activation matrix G (generated by TaDA) and the estimated activation matrix H for both TaDA and EMA cases.

Top five canonical correlation values between the gestural activation matrix G (generated by TaDA) and the estimated activation matrix H for both TaDA and EMA cases.

Article metrics loading...

Full text loading...

Commenting has been disabled for this content