Covertly improving intelligibility with data-driven adaptations of speech timing

Human talkers often address listeners with language-comprehension challenges, such as hard-of-hearing or non-native adults, by globally slowing down their speech. However, it remains unclear whether this strategy actually makes speech more intelligible. Here, we take advantage of recent advancements in machine-generated speech allowing more precise control of speech rate in order to systematically examine how targeted speech-rate adjustments may improve comprehension. We first use reverse-correlation experiments to show that the temporal influence of speech rate prior to a target vowel contrast (ex. the tense-lax distinction) in fact manifests in a scissor-like pattern, with opposite effects in early versus late context windows; this pattern is remarkably stable both within individuals and across native L1-English listeners and L2-English listeners with French, Mandarin, and Japanese L1s. Second, we show that this speech rate structure not only facilitates L2 listeners' comprehension of the target vowel contrast, but that native listeners also rely on this pattern in challenging acoustic conditions. Finally, we build a data-driven text-to-speech algorithm that replicates this temporal structure on novel speech sequences. Across a variety of sentences and vowel contrasts, listeners remained unaware that such targeted slowing improved word comprehension. Strikingly, participants instead judged the common strategy of global slowing as clearer, even though it actually increased comprehension errors. Together, these results show that targeted adjustments to speech rate significantly aid intelligibility under challenging conditions, while often going unnoticed. More generally, this paper provides a data-driven methodology to improve the accessibility of machine-generated speech which can be extended to other aspects of speech comprehension and a wide variety of listeners and environments.

翻译：人类说话者常常通过整体放慢语速来应对听者的语言理解困难，例如听力障碍或非母语的成年人。然而，这一策略是否真的能让语音更清晰，目前尚不清楚。本文利用机器生成语音的最新进展，能够更精确地控制语速，从而系统性地研究针对性的语速调整如何改善理解。我们首先通过反向相关实验表明，在目标元音对比（例如紧元音与松元音的区别）之前的语速的时间影响实际上呈现剪刀状模式，即在早期和晚期上下文中具有相反的效果；这一模式在个体内部以及母语为英语的听者和母语为法语、普通话和日语的第二语言英语听者之间都表现出显著的稳定性。其次，我们发现这种语速结构不仅有助于第二语言听者理解目标元音对比，而且母语听者在具有挑战性的声学条件下也依赖这一模式。最后，我们构建了一个数据驱动的文本到语音算法，该算法能在新的语音序列上复制这种时间结构。在多种句子和元音对比中，听者并未意识到这种针对性的放慢语音能改善单词理解。引人注目的是，参与者反而认为常见的整体放慢策略更清晰，尽管实际上它增加了理解错误。综合来看，这些结果表明，在具有挑战性的条件下，针对性的语速调整能显著提高清晰度，且往往不被察觉。更广泛地说，本文提供了一种数据驱动的方法，以提高机器生成语音的可理解性，该方法可扩展到语音理解的其他方面以及各种听者和环境。