This paper deals with Audio-Visual Speech Recognition (AVSR) under multimodal input corruption situations where audio inputs and visual inputs are both corrupted, which is not well addressed in previous research directions. Previous studies have focused on how to complement the corrupted audio inputs with the clean visual inputs with the assumption of the availability of clean visual inputs. However, in real life, clean visual inputs are not always accessible and can even be corrupted by occluded lip regions or noises. Thus, we firstly analyze that the previous AVSR models are not indeed robust to the corruption of multimodal input streams, the audio and the visual inputs, compared to uni-modal models. Then, we design multimodal input corruption modeling to develop robust AVSR models. Lastly, we propose a novel AVSR framework, namely Audio-Visual Reliability Scoring module (AV-RelScore), that is robust to the corrupted multimodal inputs. The AV-RelScore can determine which input modal stream is reliable or not for the prediction and also can exploit the more reliable streams in prediction. The effectiveness of the proposed method is evaluated with comprehensive experiments on popular benchmark databases, LRS2 and LRS3. We also show that the reliability scores obtained by AV-RelScore well reflect the degree of corruption and make the proposed model focus on the reliable multimodal representations.
翻译:本文针对音频与视觉输入同时受损的多模态输入损坏场景下的音视频语音识别(AVSR)问题展开研究,该问题在先前研究方向中未得到充分探讨。已有研究聚焦于利用清晰视觉输入补充受损音频输入,且假设能够获取清晰的视觉输入。然而在现实场景中,清晰视觉输入并非随时可得,甚至可能因嘴唇区域遮挡或噪声干扰而受损。为此,我们首先分析发现,相较于单模态模型,现有AVSR模型对多模态输入流(音频与视觉输入)的损坏确实缺乏鲁棒性。进而我们设计多模态输入损坏建模方法以开发鲁棒AVSR模型。最后提出新型AVSR框架——音视频可靠性评分模块(AV-RelScore),该框架对受损多模态输入具有鲁棒性。AV-RelScore能判断各输入模态流对预测的可靠性,并可在预测中利用更可靠的模态流。通过在主流基准数据集LRS2和LRS3上的综合实验验证了所提方法的有效性。研究还表明,AV-RelScore获得的可靠性评分能有效反映损坏程度,并使模型聚焦于可靠的多模态表征。