This paper describes the NPU-MSXF system for the IWSLT 2023 speech-to-speech translation (S2ST) task which aims to translate from English speech of multi-source to Chinese speech. The system is built in a cascaded manner consisting of automatic speech recognition (ASR), machine translation (MT), and text-to-speech (TTS). We make tremendous efforts to handle the challenging multi-source input. Specifically, to improve the robustness to multi-source speech input, we adopt various data augmentation strategies and a ROVER-based score fusion on multiple ASR model outputs. To better handle the noisy ASR transcripts, we introduce a three-stage fine-tuning strategy to improve translation accuracy. Finally, we build a TTS model with high naturalness and sound quality, which leverages a two-stage framework, using network bottleneck features as a robust intermediate representation for speaker timbre and linguistic content disentanglement. Based on the two-stage framework, pre-trained speaker embedding is leveraged as a condition to transfer the speaker timbre in the source English speech to the translated Chinese speech. Experimental results show that our system has high translation accuracy, speech naturalness, sound quality, and speaker similarity. Moreover, it shows good robustness to multi-source data.
翻译:本文描述了NPU-MSXF系统在IWSLT 2023语音到语音翻译(S2ST)任务中的应用,该任务旨在将多源英语语音翻译为中文语音。系统采用级联架构,由自动语音识别(ASR)、机器翻译(MT)和文本到语音合成(TTS)模块组成。我们投入大量精力应对多源输入带来的挑战。具体而言,为提升对多源语音输入的鲁棒性,我们采用了多种数据增强策略,并基于ROVER分数融合方法整合多个ASR模型输出。为更好地处理带噪的ASR转录文本,我们引入三阶段微调策略以提高翻译精度。最后,我们构建了兼具高自然度和音质的TTS模型,该模型采用两阶段框架,利用网络瓶颈特征作为鲁棒中间表示以实现说话人音色与语言内容的解耦。基于该两阶段框架,我们将预训练的说话人嵌入作为条件,将源英语语音中的说话人音色迁移至翻译后的中文语音。实验结果表明,我们的系统在翻译精度、语音自然度、音质和说话人相似度方面均表现优异,且对多源数据具有良好的鲁棒性。