Employing additional multimodal information to improve automatic speech recognition (ASR) performance has been proven effective in previous works. However, many of these works focus only on the utilization of visual cues from human lip motion. In fact, context-dependent visual and linguistic cues can also be used to improve ASR performance in many scenarios. In this paper, we first propose a multimodal ASR model (ViLaS) that can simultaneously or separately integrate visual and linguistic cues to help recognize the input speech, and introduce a training strategy that can improve performance in modal-incomplete test scenarios. Then, we create a multimodal ASR dataset (VSDial) with visual and linguistic cues to explore the effects of integrating vision and language. Finally, we report empirical results on the public Flickr8K and self-constructed VSDial datasets, investigate cross-modal fusion schemes, and analyze fine-grained cross-modal alignment on VSDial.
翻译:利用额外多模态信息提升自动语音识别(ASR)性能已被先前研究证实有效。然而,现有工作多聚焦于人类唇部运动的视觉线索利用。事实上,上下文相关的视觉与语言线索在诸多场景中亦可增强ASR性能。本文首先提出一种多模态ASR模型(ViLaS),该模型能同步或分别融合视觉与语言线索辅助输入语音识别,并引入一种可在模态缺失测试场景中提升性能的训练策略。随后,我们构建了包含视觉与语言线索的多模态ASR数据集(VSDial),以探究视觉与语言融合的效果。最后,我们在公开的Flickr8K数据集与自建的VSDial数据集上报告实证结果,研究跨模态融合方案,并在VSDial上分析细粒度的跨模态对齐。