Automatic speech recognition (ASR) has reached a level of accuracy in recent years, that even outperforms humans in transcribing speech to text. Nevertheless, all current ASR approaches show a certain weakness against ambient noise. To reduce this weakness, audio-visual speech recognition (AVSR) approaches additionally consider visual information from lip movements for transcription. This additional modality increases the computational cost for training models from scratch. We propose an approach, that builds on a pre-trained ASR model and extends it with an adaptive upstream module, that fuses audio and visual information. Since we do not need to train the transformer structure from scratch, our approach requires a fraction of the computational resources compared to traditional AVSR models. Compared to current SOTA systems like AV-HuBERT, our approach achieves an average improvement of 8.3% in word error rate across different model sizes, noise categories and broad SNR range. The approach allows up to 21% smaller models and requires only a fraction of the computational resources for training and inference compared to common AVSR approaches.
翻译:自动语音识别(ASR)近年来已达到较高准确率,甚至在转录语音为文本方面超越人类表现。然而,现有所有ASR方法对背景噪声均存在一定弱点。为克服这一缺陷,音频-视觉语音识别(AVSR)方法额外利用唇部运动的视觉信息进行转录。这种额外模态增加了从零开始训练模型的计算成本。本文提出一种方法,基于预训练ASR模型构建,并通过自适应上游模块扩展以融合音频与视觉信息。由于无需从头训练Transformer结构,该方法所需计算资源仅为传统AVSR模型的极小部分。与当前最先进系统如AV-HuBERT相比,本方法在不同模型规模、噪声类别及宽信噪比范围内,词错误率平均降低8.3%。该方法可支持模型规模减小21%,且训练与推理所需计算资源仅为常见AVSR方法的极小部分。