Audio-Visual Speech Recognition (AVSR) uses lip-based video to improve performance in noise. Since videos are harder to obtain than audio, the video training data of AVSR models is usually limited to a few thousand hours. In contrast, speech models such as Whisper are trained with hundreds of thousands of hours of data, and thus learn a better speech-to-text decoder. The huge training data difference motivates us to adapt Whisper to handle video inputs. Inspired by Flamingo which injects visual features into language models, we propose Whisper-Flamingo which integrates visual features into the Whisper speech recognition and translation model with gated cross attention. Our models achieve state-of-the-art ASR WER (0.68%) and AVSR WER (0.76%) on LRS3. Audio-visual Whisper-Flamingo outperforms audio-only Whisper on English speech recognition and En-X translation for 6 languages in noisy conditions. Moreover, Whisper-Flamingo is versatile and conducts all of these tasks using one set of parameters, while prior methods are trained separately on each language.
翻译:视听语音识别利用基于唇部的视频来提升噪声环境下的性能。由于视频比音频更难获取,AVSR模型的视频训练数据通常仅限于数千小时。相比之下,诸如Whisper等语音模型使用数十万小时的数据进行训练,从而学习到了更优的语音到文本解码器。巨大的训练数据差异促使我们改进Whisper以处理视频输入。受Flamingo将视觉特征注入语言模型的启发,我们提出了Whisper-Flamingo,它通过门控交叉注意力将视觉特征集成到Whisper语音识别与翻译模型中。我们的模型在LRS3数据集上实现了最先进的ASR词错误率(0.68%)和AVSR词错误率(0.76%)。在噪声条件下,视听Whisper-Flamingo在英语语音识别及六种语言的英-X翻译任务上均优于纯音频Whisper。此外,Whisper-Flamingo具有多功能性,仅使用一组参数即可执行所有上述任务,而先前的方法需针对每种语言分别进行训练。