We analyze the syntactic sensitivity of Text-to-Speech (TTS) systems using methods inspired by psycholinguistic research. Specifically, we focus on the generation of intonational phrase boundaries, which can often be predicted by identifying syntactic boundaries within a sentence. We find that TTS systems struggle to accurately generate intonational phrase boundaries in sentences where syntactic boundaries are ambiguous (e.g., garden path sentences or sentences with attachment ambiguity). In these cases, systems need superficial cues such as commas to place boundaries at the correct positions. In contrast, for sentences with simpler syntactic structures, we find that systems do incorporate syntactic cues beyond surface markers. Finally, we finetune models on sentences without commas at the syntactic boundary positions, encouraging them to focus on more subtle linguistic cues. Our findings indicate that this leads to more distinct intonation patterns that better reflect the underlying structure.
翻译:我们采用受心理语言学研究启发的方法,分析文本转语音(TTS)系统的句法敏感性。具体而言,我们聚焦于语调短语边界的生成——该边界通常可通过识别句子内部的句法边界进行预测。研究发现,在句法边界存在歧义的句子中(如花园路径句或附着歧义句),TTS系统难以准确生成语调短语边界。此类情况下,系统需要依赖逗号等表层线索才能将边界置于正确位置。相反,对于句法结构较简单的句子,我们发现系统确实能整合超越表层标记的句法线索。最后,我们在句法边界位置无逗号的句子上对模型进行微调,促使其关注更细微的语言学线索。研究结果表明,该方法能产生更清晰的语调模式,从而更好地反映底层句法结构。