This study proposes a segmental-level prosodic probing framework to evaluate neural TTS models' ability to reproduce consonant-induced f0 perturbation, a fine-grained segmental-prosodic effect that reflects local articulatory mechanisms. We compare synthetic and natural speech realizations for thousands of words, stratified by lexical frequency, using Tacotron 2 and FastSpeech 2 trained on the same speech corpus (LJ Speech). These controlled analyses are then complemented by a large-scale evaluation spanning multiple advanced TTS systems. Results show accurate reproduction for high-frequency words but poor generalization to low-frequency items, suggesting that the examined TTS architectures rely more on lexical-level memorization than on abstract segmental-prosodic encoding. This finding highlights a limitation in such TTS systems' ability to generalize prosodic detail beyond seen data. The proposed probe offers a linguistically informed diagnostic framework that may inform future TTS evaluation methods, and has implications for interpretability and authenticity assessment in synthetic speech.
翻译:本研究提出了一种分段级韵律探测框架,用于评估神经TTS模型再现辅音诱导的基频扰动(一种反映局部发音机制的细粒度分段-韵律效应)的能力。我们利用在相同语音语料库(LJ Speech)上训练的Tacotron 2和FastSpeech 2,对数千个按词频分层的单词,比较了合成语音与自然语音的实现。这些控制性分析随后通过覆盖多个先进TTS系统的大规模评估得以补充。结果表明,高频词能准确再现,但低频词的泛化性能较差,这表明所考察的TTS架构更多依赖于词汇级记忆而非抽象的分段-韵律编码。这一发现凸显了此类TTS系统在将韵律细节泛化至未见数据方面的局限性。所提出的探测框架提供了一个基于语言学知识的诊断工具,可为未来的TTS评估方法提供参考,并对合成语音的可解释性与真实性评估具有重要意义。