Many spoken languages, including English, exhibit wide variation in dialects and accents, making accent control an important capability for flexible text-to-speech (TTS) models. Current TTS systems typically generate accented speech by conditioning on speaker embeddings associated with specific accents. While effective, this approach offers limited interpretability and controllability, as embeddings also encode traits such as timbre and emotion. In this study, we analyze the interaction between speaker embeddings and linguistically motivated phonological rules in accented speech synthesis. Using American and British English as a case study, we implement rules for flapping, rhoticity, and vowel correspondences. We propose the phoneme shift rate (PSR), a novel metric quantifying how strongly embeddings preserve or override rule-based transformations. Experiments show that combining rules with embeddings yields more authentic accents, while embeddings can attenuate or overwrite rules, revealing entanglement between accent and speaker identity. Our findings highlight rules as a lever for accent control and a framework for evaluating disentanglement in speech generation.
翻译:包括英语在内的许多口语语言在方言和口音上表现出广泛的差异,这使得口音控制成为灵活文本转语音(TTS)模型的一项重要能力。当前的TTS系统通常通过以与特定口音相关联的说话人嵌入为条件来生成带口音的语音。这种方法虽然有效,但可解释性和可控性有限,因为嵌入同时编码了音色和情感等特征。在本研究中,我们分析了口音语音合成中说话人嵌入与基于语言学的音系规则之间的交互作用。以美式英语和英式英语为案例,我们实现了闪音化、卷舌音特性及元音对应关系的规则。我们提出了音素替换率(PSR),这是一种新颖的度量标准,用于量化嵌入在多大程度上保留或覆盖基于规则的音素转换。实验表明,将规则与嵌入相结合能产生更真实的口音,而嵌入可以减弱或覆盖规则,这揭示了口音与说话人身份之间的纠缠。我们的研究结果强调了规则作为口音控制的杠杆作用,并为评估语音生成中的解纠缠提供了一个框架。