Timbre-Aware LLM-based Direct Speech-to-Speech Translation Extendable to Multiple Language Pairs

Direct Speech-to-Speech Translation (S2ST) has gained increasing attention for its ability to translate speech from one language to another, while reducing error propagation and latency inherent in traditional cascaded pipelines. However, existing direct S2ST systems continue to face notable challenges, including instability in semantic-acoustic alignment when parallel speech data is scarce, difficulty in preserving speaker identity, and limited multilingual scalability. In this work, we introduce DS2ST-LM, a scalable, single-stage direct S2ST framework leveraging a multilingual Large Language Model (LLM). The architecture integrates a Whisper speech encoder, a learnable projection module, a Qwen2-0.5B LLM, and a timbre-controlled vocoder. We construct GigaS2S-1000, a 1000-hour bilingual corpus by extending the GigaST dataset with high-fidelity synthetic target speech, and show that this synthetic data alleviates data scarcity to some extent. We investigate two semantic token generation strategies: speech-derived S3 tokens and text-derived tokens generated by a pre-trained LLM, and analyze their impact on training stability and semantic consistency. We further evaluate three projection architectures (Linear, Conv1D-Linear, and Q-Former) and observe that while higher-capacity projectors converge faster, the simple Linear projector achieves higher performance. Extensive experiments demonstrate that DS2ST-LM outperforms traditional cascaded and ST (Qwen-Audio) + TTS baselines across both lexical (BLEU, METEOR) and semantic (BLEURT, COMET) metrics, while extending to multiple language pairs, including French, Spanish, German, Hindi, Bengali, and Urdu. Furthermore, we incorporate timbre-aware speech synthesis to preserve speaker information, enabling DS2ST-LM to surpass prior direct S2ST systems in both speaker similarity and perceptual naturalness.

翻译：直接语音到语音翻译（S2ST）因其能够将一种语言的语音翻译为另一种语言，同时减少传统级联流程中固有的错误传播和延迟，而受到越来越多的关注。然而，现有的直接S2ST系统仍面临显著挑战，包括在平行语音数据稀缺时语义-声学对齐的不稳定性、难以保持说话人身份以及有限的多语言扩展能力。在本工作中，我们提出了DS2ST-LM，一个利用多语言大语言模型（LLM）的可扩展、单阶段直接S2ST框架。该架构集成了Whisper语音编码器、可学习的投影模块、Qwen2-0.5B LLM以及音色可控声码器。我们通过扩展GigaST数据集并加入高保真合成目标语音，构建了1000小时的双语语料库GigaS2S-1000，并证明这种合成数据在一定程度上缓解了数据稀缺问题。我们研究了两种语义标记生成策略：语音衍生的S3标记和由预训练LLM生成的文本衍生标记，并分析了它们对训练稳定性和语义一致性的影响。我们进一步评估了三种投影架构（线性、Conv1D-线性和 Q-Former），并观察到虽然更高容量的投影器收敛更快，但简单的线性投影器实现了更高的性能。大量实验表明，DS2ST-LM在词汇（BLEU、METEOR）和语义（BLEURT、COMET）指标上均优于传统级联以及ST（Qwen-Audio）+ TTS基线，同时可扩展到包括法语、西班牙语、德语、印地语、孟加拉语和乌尔都语在内的多种语言对。此外，我们融入了音色感知语音合成以保留说话人信息，使DS2ST-LM在说话人相似度和感知自然度方面均超越了先前的直接S2ST系统。