Cross-modal Consistency Guidance for Robust Emotion Control in Auto-Regressive TTS Models

While Text-to-Speech (TTS) systems enable emotional control via natural-language instructions, expressiveness, naturalness, and speech quality degrade when the target emotion conflicts with the textual semantics. We propose a Cross-modal Consistency Guided Classifier-Free Guidance (CCG-CFG) method with dynamic scales based on the degree of inconsistency between the text emotion and the explicit speech emotion, replacing the dropout condition with the text emotion. We also distill the CCG-CFG guidance signal using a hard-sample mining strategy, improving the TTS model's emotional alignment capability. Evaluations on five emotional corpora and two TTS benchmarks show that our approaches applied to CosyVoice2 achieve up to a 12% absolute improvement in emotion-recognition accuracy and a 10% relative improvement in subjective scores, outperforming baselines including HierSpeech++, Qwen3-TTS, and original CosyVoice2, while preserving intelligibility, naturalness, and high speech quality.

翻译：尽管基于文本到语音（TTS）系统能够通过自然语言指令实现情感控制，但当目标情感与文本语义冲突时，表达力、自然度和语音质量会显著下降。我们提出了一种基于跨模态一致性引导的无分类器引导（CCG-CFG）方法，该方法根据文本情感与显式语音情感之间的不一致程度动态调整缩放系数，并以文本情感替代丢弃条件。同时，我们利用难样本挖掘策略对CCG-CFG引导信号进行蒸馏，从而提升TTS模型的情感对齐能力。在五个情感语料库和两个TTS基准测试上的评估表明，将我们的方法应用于CosyVoice2后，情感识别准确率绝对提升高达12%，主观评分相对提升10%，在保持可懂度、自然度和高语音质量的同时，优于HierSpeech++、Qwen3-TTS及原始CosyVoice2等基线模型。

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。

【综述】大语言模型驱动的多模态情感识别综述：挑战、分类与未来方向

专知会员服务

13+阅读 · 5月22日

降解语音：通过输入操控实现鲁棒性语音转换的全面综述

专知会员服务

13+阅读 · 1月28日

多模态对话情感识别：方法、趋势、挑战与前景综述

专知会员服务

20+阅读 · 2025年5月28日

多模态基础模型的机制可解释性综述

专知会员服务

43+阅读 · 2025年2月28日