Cross-lingual voice cloning aims to generate speech in a target language while preserving speaker identity from a source-language reference. This task is central to speech translation and is the focus of the IWSLT 2026 Cross-Lingual Voice Cloning track. A key challenge is maintaining intelligibility and naturalness in the presence of accent variation and domain-specific vocabulary. We build on a multilingual text-to-speech model, FishAudio-S2-Pro, and introduce language tag prompting to improve language control and reduce accent leakage. We further apply reinforcement learning (RL) fine-tuning for task adaptation and observe improvements in intelligibility. Finally, we propose a reference-conditioned lexical matching method that improves pronunciation of domain-specific terms when lexical overlap is present. Results show that language prompting provides the largest gains, while lexical matching yields consistent improvements on matched subsets.
翻译:跨语言声音克隆旨在生成目标语言的语音,同时保留源语言参考语音的说话人身份。该任务是语音翻译的核心,也是IWSLT 2026跨语言声音克隆赛道的研究重点。关键挑战在于:在口音差异和领域专属词汇存在的情况下,保持语音的可懂度和自然性。我们基于多语言文本转语音模型FishAudio-S2-Pro进行构建,引入语言标签提示以增强语言控制并减少口音泄露。我们进一步采用强化学习微调进行任务适配,观察到可懂度的提升。最后,我们提出一种参考条件驱动的词汇匹配方法,在词汇重叠存在时改善领域专属术语的发音。实验结果表明,语言提示带来的提升最大,而词汇匹配方法在重叠子集上取得了一致的改善效果。