While Text-to-Speech (TTS) systems enable emotional control via natural-language instructions, expressiveness, naturalness, and speech quality degrade when the target emotion conflicts with the textual semantics. We propose a Cross-modal Consistency Guided Classifier-Free Guidance (CCG-CFG) method with dynamic scales based on the degree of inconsistency between the text emotion and the explicit speech emotion, replacing the dropout condition with the text emotion. We also distill the CCG-CFG guidance signal using a hard-sample mining strategy, improving the TTS model's emotional alignment capability. Evaluations on five emotional corpora and two TTS benchmarks show that our approaches applied to CosyVoice2 achieve up to a 12% absolute improvement in emotion-recognition accuracy and a 10% relative improvement in subjective scores, outperforming baselines including HierSpeech++, Qwen3-TTS, and original CosyVoice2, while preserving intelligibility, naturalness, and high speech quality.
翻译:尽管基于文本到语音(TTS)系统能够通过自然语言指令实现情感控制,但当目标情感与文本语义冲突时,表达力、自然度和语音质量会显著下降。我们提出了一种基于跨模态一致性引导的无分类器引导(CCG-CFG)方法,该方法根据文本情感与显式语音情感之间的不一致程度动态调整缩放系数,并以文本情感替代丢弃条件。同时,我们利用难样本挖掘策略对CCG-CFG引导信号进行蒸馏,从而提升TTS模型的情感对齐能力。在五个情感语料库和两个TTS基准测试上的评估表明,将我们的方法应用于CosyVoice2后,情感识别准确率绝对提升高达12%,主观评分相对提升10%,在保持可懂度、自然度和高语音质量的同时,优于HierSpeech++、Qwen3-TTS及原始CosyVoice2等基线模型。