Speech-to-speech translation (S2ST) systems have achieved impressive progress in semantic accuracy and speech naturalness. However, the cross-lingual transfer of lexical stress, a vital cue for emphasis and speaker intent, remains heavily underexplored, compounded by a lack of reliable automatic evaluation metrics for tonal languages like Chinese. We investigate English-to-Chinese S2ST stress transfer by constructing a stress-annotated Chinese dataset and an XLS-R-based Mandarin stress detector. Integrating this with the English EmphAssess system, we propose a novel objective metric for cross-lingual stress evaluation. Furthermore, we fine-tune CosyVoice3 to build a stress-aware S2ST system. Experiments demonstrate that our proposed S2ST architecture significantly outperforms existing systems in stress translation capability while maintaining competitive translation quality. Furthermore, our evaluation metric exhibits a strong correlation with human subjective judgments.
翻译:语音到语音翻译(S2ST)系统在语义准确性和语音自然度方面取得了显著进展。然而,作为强调和说话者意图重要线索的词汇重音,其跨语言迁移问题仍未得到充分探索,加之缺乏针对汉语等声调语言的可靠自动评估指标,这一挑战愈发突出。我们通过构建重音标注的汉语数据集和基于XLS-R的普通话重音检测器,研究了英汉S2ST中的重音迁移。结合英语EmphAssess系统,我们提出了一种用于跨语言重音评估的新型客观指标。此外,我们对CosyVoice3进行了微调,构建了具备重音感知能力的S2ST系统。实验表明,我们提出的S2ST架构在重音翻译能力上显著优于现有系统,同时保持了具有竞争力的翻译质量。进一步地,我们的评估指标与人类主观判断展现出强相关性。