We introduce MuAViC, a multilingual audio-visual corpus for robust speech recognition and robust speech-to-text translation providing 1200 hours of audio-visual speech in 9 languages. It is fully transcribed and covers 6 English-to-X translation as well as 6 X-to-English translation directions. To the best of our knowledge, this is the first open benchmark for audio-visual speech-to-text translation and the largest open benchmark for multilingual audio-visual speech recognition. Our baseline results show that MuAViC is effective for building noise-robust speech recognition and translation models. We make the corpus available at https://github.com/facebookresearch/muavic.
翻译:我们推出了MuAViC,一个面向鲁棒语音识别与鲁棒语音-文本翻译的多语言视听语料库,提供9种语言共计1200小时的视听语音数据。该语料库已完整标注,涵盖6个英语到目标语言翻译方向及6个目标语言到英语翻译方向。据我们所知,这是首个公开的视听语音-文本翻译基准,也是规模最大的多语言视听语音识别公开基准。我们的基线结果表明,MuAViC能够有效支持构建噪声鲁棒的语音识别与翻译模型。该语料库现已开放,访问地址为https://github.com/facebookresearch/muavic。