Possible shapes for the convolution kernel, . More possible shapes can be obtained by reversing the time (delay) axis or the vertical axis.
The phase for a typical utterance plotted against time. The algorithm missed a prediction near , then behaved well for the rest of the utterance (dashed line). The solid line shows , wrapped into the range from to to simulate the behavior of Eq. (10).
Values if , averaged over the corpus. The lower left measurement is the baseline, where acoustic data are shuffled with respect to ticks. Other conditions correspond to the six cases of Sec. III B, in order: running the analysis only on , and enhanced by other acoustical properties .
(Color online) Convolution kernels, , that are optimal for bootstrap samples of the data. The maxima of the curves are aligned at . (These kernels maximize , and have , , and zero, corresponding to .)
(Color online) Optimal bootstrap samples of the convolution kernel, , for the analysis. The maxima of the curves are aligned at . (These kernels maximize , and allow , , and to be nonzero.)
Theoretical loudness contours based on this work for prominences that are (top) on adjacent syllables, (middle) separated by one, and (bottom) by two syllables. The dashed line is a loudness reference.
Phase histograms (, relative to the phase of ) for three different subjects. The subjects shown have (reading from top to bottom, in the center) the largest, median, and smallest value of . The sub-figure shows values of for each subject, with error bars on the average. (The horizontal axis in the subfigure has no meaning—it just separates subjects.)
Phase histogram for a typical speaker (outline) and the histogram of relative to the average phase of each utterance (filled). The peak of the dashed histogram shows the typical phase relationship between metronome ticks and the algorithm’s predictions for that subject. The width of the histograms show timing inconsistencies between the subject’s speech and the metronome.
The phase relationship between the metronome ticks and peaks of (loudness, loosely speaking). Each utterance (i.e., ten repetitions of one text by one speaker) is represented by a dot at the (complex) value of , with the real part on the horizontal axis and the imaginary part on the vertical axis. The distance from the origin is thus proportional to for that utterance. Dots near the origin represent utterances that did not have a consistent phase relationship between loudness and the metronome; dots on the unit circle would have a perfectly consistent phase relationship. The angle of the point, when viewed from the origin matches the phase of , and dots just to the right of the origin come from utterances where the peak in is aligned with the metronome ticks. Dashed circles represent the average of each subject’s utterances.
Sample audio data along with the spectral slope, . This shows one repetition of “We always do.”
Parameters that yield the largest . The analysis operates on only . The right column shows the distribution of values that were tested in the optimization procedure (90 000 samples), and the center column shows the distribution of optimal values that were found (3400 bootstrap corpora).
Article metrics loading...
Full text loading...