It has been shown that learning audiovisual features can lead to improved speech recognition performance over audio-only features, especially for noisy speech. However, in many common applications, the visual features are partially or entirely missing, e.g.~the speaker might move off screen. Multi-modal models need to be robust: missing video frames should not degrade the performance of an audiovisual model to be worse than that of a single-modality audio-only model. While there have been many attempts at building robust models, there is little consensus on how robustness should be evaluated. To address this, we introduce a framework that allows claims about robustness to be evaluated in a precise and testable way. We also conduct a systematic empirical study of the robustness of common audiovisual speech recognition architectures on a range of acoustic noise conditions and test suites. Finally, we show that an architecture-agnostic solution based on cascades can consistently achieve robustness to missing video, even in settings where existing techniques for robustness like dropout fall short.
翻译:已有研究表明,学习视听特征能比纯音频特征取得更优的语音识别性能,尤其是在含噪语音场景中。然而在许多实际应用中,视觉特征会部分或完全缺失(例如说话者可能移出屏幕范围)。多模态模型必须具备鲁棒性:缺失视频帧不应导致视听模型的性能劣于单一模态的纯音频模型。尽管已有众多构建鲁棒模型的尝试,但关于鲁棒性评估方法仍缺乏共识。为此,我们提出一个框架,使鲁棒性相关的论断能在精确且可验证的方式下进行评估。同时,我们系统性地实验研究了常见视听语音识别架构在不同噪声条件和测试集下的鲁棒性。最后,我们证明基于级联的架构无关解决方案,即使在被现有鲁棒性技术(如dropout)失效的场景中,也能始终如一地实现视频缺失鲁棒性。