We introduce MuAViC, a multilingual audio-visual corpus for robust speech recognition and robust speech-to-text translation providing 1200 hours of audio-visual speech in 9 languages. It is fully transcribed and covers 6 English-to-X translation as well as 6 X-to-English translation directions. To the best of our knowledge, this is the first open benchmark for audio-visual speech-to-text translation and the largest open benchmark for multilingual audio-visual speech recognition. Our baseline results show that MuAViC is effective for building noise-robust speech recognition and translation models. We make the corpus available at https://github.com/facebookresearch/muavic.
翻译:我们介绍MuAViC,这是一个用于鲁棒语音识别与鲁棒语音到文本翻译的多语言视听语料库,提供了9种语言共1200小时的视听语音数据。该语料库已完成完整转写,涵盖6个英语到目标语言以及6个目标语言到英语的翻译方向。据我们所知,这是首个面向视听语音到文本翻译的开放基准,也是规模最大的多语言视听语音识别开放基准。我们的基线结果表明,MuAViC能够有效构建噪声鲁棒的语音识别与翻译模型。该语料库可通过https://github.com/facebookresearch/muavic获取。