This paper investigates the use of word surprisal, a measure of the predictability of a word in a given context, as a feature to aid speech synthesis prosody. We explore how word surprisal extracted from large language models (LLMs) correlates with word prominence, a signal-based measure of the salience of a word in a given discourse. We also examine how context length and LLM size affect the results, and how a speech synthesizer conditioned with surprisal values compares with a baseline system. To evaluate these factors, we conducted experiments using a large corpus of English text and LLMs of varying sizes. Our results show that word surprisal and word prominence are moderately correlated, suggesting that they capture related but distinct aspects of language use. We find that length of context and size of the LLM impact the correlations, but not in the direction anticipated, with longer contexts and larger LLMs generally underpredicting prominent words in a nearly linear manner. We demonstrate that, in line with these findings, a speech synthesizer conditioned with surprisal values provides a minimal improvement over the baseline with the results suggesting a limited effect of using surprisal values for eliciting appropriate prominence patterns.
翻译:本文探讨了将词语惊奇度——即特定语境下词语可预测性的度量——作为辅助语音合成韵律特征的应用。我们研究了从大型语言模型(LLMs)提取的词语惊奇度与词语显著性(一种基于信号的、衡量特定话语中词语凸显程度的指标)之间的相关性。同时分析了上下文长度和LLM规模对结果的影响,以及以惊奇度值作为条件的语音合成器与基线系统的性能对比。为评估这些因素,我们使用大规模英语文本语料库及不同规模的LLMs开展了实验。结果表明,词语惊奇度与词语显著性呈中等程度相关,暗示两者捕捉到了语言使用中相关但不同的方面。我们发现上下文长度和LLM规模会影响相关性,但影响方向与预期相反——更长的上下文和更大的LLM几乎以线性方式系统性地低估了显著词语。基于这些发现,我们证明以惊奇度值作为条件的语音合成器相比基线系统仅能带来微小改进,结果提示使用惊奇度值来诱发恰当显著性模式的效果有限。