The Lombard effect plays a key role in natural communication, particularly in noisy environments or when addressing hearing-impaired listeners. We present a controllable text-to-speech (TTS) system capable of synthesizing Lombard speech for any speaker without requiring explicit Lombard data during training. Our approach leverages style embeddings learned from a large, prosodically diverse dataset and analyzes their correlation with Lombard attributes using principal component analysis (PCA). By shifting the relevant PCA components, we manipulate the style embeddings and incorporate them into our TTS model to generate speech at desired Lombard levels. Evaluations demonstrate that our method preserves naturalness and speaker identity, enhances intelligibility under noise, and provides fine-grained control over prosody, offering a robust solution for controllable Lombard TTS for any speaker.
翻译:隆巴德效应在自然交流中起着关键作用,尤其在嘈杂环境或面向听力受损听众时。本文提出一种可控的文本转语音系统,能够在训练过程中无需显式隆巴德数据的情况下,为任意说话人生成隆巴德语音。我们的方法利用从大规模韵律多样性数据集学习到的风格嵌入,并通过主成分分析解析其与隆巴德属性的相关性。通过偏移相关的主成分,我们操纵风格嵌入并将其整合到文本转语音模型中,从而生成具有指定隆巴德强度的语音。评估结果表明,该方法在保持自然度与说话人身份特征的同时,能有效提升噪声环境下的语音可懂度,并提供精细的韵律控制,为任意说话人提供了鲁棒的可控隆巴德语音合成解决方案。