Task-Vector Arithmetic for Emotional Expressivity Control in Language-Model-Based Text-to-Speech

We investigate whether task-vector arithmetic, successful for cross-speaker emotional intensity control in modular text-to-speech (TTS), transfers to large-scale TTS systems built on language-model backbones with in-context learning (LM-TTS). Through a systematic elimination study over four progressively narrower operands on Qwen3-TTS-12Hz-1.7B - model weights via LoRA fine-tuning, continuous codec embeddings, discrete codec tokens, and the speaker embedding (x-vector) produced by an ECAPA-TDNN encoder jointly trained with the synthesis backbone - we localize the dominant carrier of emotional prosody to the x-vector. Building on this finding, we propose a training-free method based on centroid arithmetic in x-vector space: an emotion direction $τ= \mathbb{E}_i[x(s_i,\text{emo})] -\mathbb{E}_i[x(s_i,\text{neutral})]$ applied to an unseen target speaker as $x_{\text{new}} = x(\text{target},\text{neutral}) + α\cdotτ$. Using ESD (English) as the $τ$ source and emoUERJ (Brazilian Portuguese) as a cross-lingual ground-truth target, we observe average gains of $+0.29$ in emotion2vec cosine over the ICL baseline on English held-out speakers and $+0.09$ on Brazilian Portuguese held-out speakers, while largely preserving identity (WavLM SECS $\gtrsim 0.88$ for the multi-speaker $τ$ variant) and intelligibility (WER $\approx 0$ in PT-BR). These results offer initial evidence that the reported incompatibility of centroid-arithmetic style control with token-based TTS architectures may be circumvented when the arithmetic operates on the speaker embedding.

翻译：我们研究了任务向量算术（在模块化文本到语音（TTS）中成功实现跨说话人情感强度控制）是否可迁移至基于语言模型主干、具备上下文学习能力的大规模TTS系统（LM-TTS）。通过对Qwen3-TTS-12Hz-1.7B模型逐步缩窄的四个操作数——通过LoRA微调的模型权重、连续编解码嵌入、离散编解码标记以及由与合成主干联合训练的ECAPA-TDNN编码器生成的说话人嵌入（x-向量）——进行系统性消除研究，我们定位到情感韵律的主要载体为x-向量。基于此发现，我们提出一种免训练方法，即在x-向量空间中进行质心算术：将情感方向$τ= \mathbb{E}_i[x(s_i,\text{emo})] -\mathbb{E}_i[x(s_i,\text{neutral})]$应用于未见目标说话人：$x_{\text{new}} = x(\text{target},\text{neutral}) + α\cdotτ$。以ESD（英语）作为$τ$源、emoUERJ（巴西葡萄牙语）作为跨语言真实目标，我们观察到：在英语保留说话人上，情感余弦相似度较ICL基线平均提升$+0.29$；在巴西葡萄牙语保留说话人上提升$+0.09$，同时说话人身份（多说话人$τ$变体的WavLM SECS $\gtrsim 0.88$）和可懂度（葡萄牙语WER $\approx 0$）基本保持。这些结果为以下观点提供了初步证据：当算术操作作用于说话人嵌入时，质心算术风格控制与基于标记的TTS架构之间已知的不兼容性可能得以规避。

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。

【MIT博士论文】语言模型的推理时学习算法

专知会员服务

30+阅读 · 2025年12月24日

大型语言模型遇上文本属性图：一种融合框架与应用的综述

专知会员服务

10+阅读 · 2025年10月27日

什么是上下文工程？中科院计算所等《大语言模型的上下文工程》综述

专知会员服务

43+阅读 · 2025年7月18日

【斯坦福大学Xiang Lisa Li博士论文】控制语言模型

专知会员服务

22+阅读 · 2025年6月11日