We investigate whether task-vector arithmetic, successful for cross-speaker emotional intensity control in modular text-to-speech (TTS), transfers to large-scale TTS systems built on language-model backbones with in-context learning (LM-TTS). Through a systematic elimination study over four progressively narrower operands on Qwen3-TTS-12Hz-1.7B - model weights via LoRA fine-tuning, continuous codec embeddings, discrete codec tokens, and the speaker embedding (x-vector) produced by an ECAPA-TDNN encoder jointly trained with the synthesis backbone - we localize the dominant carrier of emotional prosody to the x-vector. Building on this finding, we propose a training-free method based on centroid arithmetic in x-vector space: an emotion direction $τ= \mathbb{E}_i[x(s_i,\text{emo})] -\mathbb{E}_i[x(s_i,\text{neutral})]$ applied to an unseen target speaker as $x_{\text{new}} = x(\text{target},\text{neutral}) + α\cdotτ$. Using ESD (English) as the $τ$ source and emoUERJ (Brazilian Portuguese) as a cross-lingual ground-truth target, we observe average gains of $+0.29$ in emotion2vec cosine over the ICL baseline on English held-out speakers and $+0.09$ on Brazilian Portuguese held-out speakers, while largely preserving identity (WavLM SECS $\gtrsim 0.88$ for the multi-speaker $τ$ variant) and intelligibility (WER $\approx 0$ in PT-BR). These results offer initial evidence that the reported incompatibility of centroid-arithmetic style control with token-based TTS architectures may be circumvented when the arithmetic operates on the speaker embedding.
翻译:我们研究了任务向量算术(在模块化文本到语音(TTS)中成功实现跨说话人情感强度控制)是否可迁移至基于语言模型主干、具备上下文学习能力的大规模TTS系统(LM-TTS)。通过对Qwen3-TTS-12Hz-1.7B模型逐步缩窄的四个操作数——通过LoRA微调的模型权重、连续编解码嵌入、离散编解码标记以及由与合成主干联合训练的ECAPA-TDNN编码器生成的说话人嵌入(x-向量)——进行系统性消除研究,我们定位到情感韵律的主要载体为x-向量。基于此发现,我们提出一种免训练方法,即在x-向量空间中进行质心算术:将情感方向$τ= \mathbb{E}_i[x(s_i,\text{emo})] -\mathbb{E}_i[x(s_i,\text{neutral})]$应用于未见目标说话人:$x_{\text{new}} = x(\text{target},\text{neutral}) + α\cdotτ$。以ESD(英语)作为$τ$源、emoUERJ(巴西葡萄牙语)作为跨语言真实目标,我们观察到:在英语保留说话人上,情感余弦相似度较ICL基线平均提升$+0.29$;在巴西葡萄牙语保留说话人上提升$+0.09$,同时说话人身份(多说话人$τ$变体的WavLM SECS $\gtrsim 0.88$)和可懂度(葡萄牙语WER $\approx 0$)基本保持。这些结果为以下观点提供了初步证据:当算术操作作用于说话人嵌入时,质心算术风格控制与基于标记的TTS架构之间已知的不兼容性可能得以规避。