What does it take to create the Babel Fish, a tool that can help individuals translate speech between any two languages? While recent breakthroughs in text-based models have pushed machine translation coverage beyond 200 languages, unified speech-to-speech translation models have yet to achieve similar strides. More specifically, conventional speech-to-speech translation systems rely on cascaded systems that perform translation progressively, putting high-performing unified systems out of reach. To address these gaps, we introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-speech translation, text-to-text translation, and automatic speech recognition for up to 100 languages. To build this, we used 1 million hours of open speech audio data to learn self-supervised speech representations with w2v-BERT 2.0. Subsequently, we created a multimodal corpus of automatically aligned speech translations. Filtered and combined with human-labeled and pseudo-labeled data, we developed the first multilingual system capable of translating from and into English for both speech and text. On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation. Compared to strong cascaded models, SeamlessM4T improves the quality of into-English translation by 1.3 BLEU points in speech-to-text and by 2.6 ASR-BLEU points in speech-to-speech. Tested for robustness, our system performs better against background noises and speaker variations in speech-to-text tasks compared to the current SOTA model. Critically, we evaluated SeamlessM4T on gender bias and added toxicity to assess translation safety. Finally, all contributions in this work are open-sourced and accessible at https://github.com/facebookresearch/seamless_communication
翻译:要创造“巴别鱼”——一种能够帮助个体在任何两种语言之间翻译语音的工具——需要什么?尽管近年来基于文本的模型在机器翻译覆盖范围上突破了200种语言,但在统一的语音到语音翻译模型方面,尚未取得类似进展。具体而言,传统的语音到语音翻译系统依赖级联系统逐步执行翻译,这使得高性能的统一系统难以实现。为填补这些空白,我们提出了SeamlessM4T,这是一个单一模型,支持至多100种语言的语音到语音翻译、语音到文本翻译、文本到语音翻译、文本到文本翻译以及自动语音识别。为此,我们使用了100万小时的公开语音音频数据,通过w2v-BERT 2.0学习自监督语音表示。随后,我们创建了一个包含自动对齐语音翻译的多模态语料库。经过过滤并与人工标注及伪标注数据结合,我们开发了首个能够支持语音和文本从英语翻译至其他语言以及从其他语言翻译至英语的多语言系统。在FLEURS数据集上,SeamlessM4T为多目标语言翻译设立了新标准,在直接语音到文本翻译中,相较于此前最优方法,BLEU值提升了20%。与强级联模型相比,SeamlessM4T在语音到文本任务中使到英语的翻译质量提升了1.3个BLEU点,在语音到语音任务中提升了2.6个ASR-BLEU点。在鲁棒性测试中,我们的系统在语音到文本任务中,对于背景噪声和说话人变化的处理能力优于当前最优模型。关键地,我们评估了SeamlessM4T在性别偏见和额外毒性方面的翻译安全性。最后,本工作的所有贡献均已开源,可通过 https://github.com/facebookresearch/seamless_communication 访问。