Here we illustrate three current LISTA research lines.
1. Human speech modifications. One goal is to better understand what speech modifications listeners employ in adverse conditions and to explore how these changes can be used to enhance natural and synthetic speech. Figure 1 shows a fragment of the recording and transcription of four simultaneous talkers, used to measure the effect of a background conversation on a foreground dialogue. Findings to date indicate that in the presence of competing talkers, speakers are less precise in timing of turns, interrupt more, and suffer from an increase in dysfluencies, false starts and hesitations. In response to this adverse condition, speakers make significant reductions in their speech rate to aid listeners in message comprehension.
Figure 1. Multi-tier, multi-channel annotation of a dual-dialogue corpus. Approximately 15000 individual labels are provided manually for each hour of speech.
2. Optimising objective speech intelligibility in noise. Another research direction involves the use of objective measures to optimise speech intelligibility by reallocating speech energy in time and frequency, without changing the overall speech level, and hence without increasing listener discomfort. Figure 2 depicts the result of various reallocation strategies designed to enhance speech in babble noise. In each case, the speech energy is the same (SNR=-5 dB), but the objective intelligibility differs, with significant improvements over the baseline no-reallocation approach. Subjective intelligibility testing will take place in February 2011.
Sound examples: M0 | M1 | M2 | M3 | M4 | M5
Figure 2: Energy reallocation in time and frequency. Regions in red in these speech spectrograms show those spectro-temporal regions whose energy is boosted while those in blue represent attenuated energy regions. M0 is the no-reallocation baseline. M1 equalises each time interval to have the same signal-to-noise ratio (SNR), while M2 equalises the SNR in each frequency region. M3 goes a step further, and ensures that each time-frequency ‘pixel’ has the same SNR. Strategies M4 and M5 optimise the range of frequencies most effective for boosting intelligibility. M5 applies the further constraint of boosting mainly those regions where the signal is on the borderline of audibility.
3. Intelligibility boosting of synthetic speech in noise. Speakers adjust their speaking style to fit the context, sometimes under-articulating and at other times using more extreme articulations which appear to be beneficial in noise. A recent development in hidden Markov model (HMM) based speech synthesis permits precise articulatory control, allowing human-like modifications of articulators to produce hypo- and hyper-articulated speech. The sound examples demonstrate synthetic speech in car noise with (i) normal articulation, (ii) hypo- , (iii) hyper-articulation, and (iv) hyper-articulation plus spectral tilt modification.