While Text-to-Speech (TTS) systems can achieve fine-grained control over emotional expression via natural language prompts, a significant challenge emerges when the desired emotion (style prompt) conflicts with the semantic content of the text. This mismatch often results in unnatural-sounding speech, undermining the goal of achieving fine-grained emotional control. Classifier-Free Guidance (CFG) is a key technique for enhancing prompt alignment; however, its application to auto-regressive (AR) TTS models remains underexplored, which can lead to degraded audio quality. This paper directly addresses the challenge of style-content mismatch in AR TTS models by proposing an adaptive CFG scheme that adjusts to different levels of the detected mismatch, as measured using large language models or natural language inference models. This solution is based on a comprehensive analysis of CFG's impact on emotional expressiveness in state-of-the-art AR TTS models. Our results demonstrate that the proposed adaptive CFG scheme improves the emotional expressiveness of the AR TTS model while maintaining audio quality and intelligibility.
翻译:尽管文本转语音(TTS)系统能够通过自然语言提示实现情感表达的细粒度控制,但当期望情感(风格提示)与文本语义内容发生冲突时,一个重大挑战随之出现。这种不匹配往往导致语音听起来不自然,从而破坏了实现细粒度情感控制的目标。无分类器引导(CFG)是增强提示对齐的关键技术;然而,其在自回归(AR)TTS模型中的应用仍待深入探索,不当应用可能导致音频质量下降。本文通过提出一种自适应CFG方案,直接应对AR TTS模型中的风格-内容不匹配挑战。该方案根据使用大语言模型或自然语言推理模型检测到的不匹配程度进行自适应调整。该解决方案基于对CFG在先进AR TTS模型中情感表现力影响的综合分析。我们的结果表明,所提出的自适应CFG方案在保持音频质量和可懂度的同时,提升了AR TTS模型的情感表现力。