The image-based multimodal automatic speech recognition (ASR) model enhances speech recognition performance by incorporating audio-related image. However, some works suggest that introducing image information to model does not help improving ASR performance. In this paper, we propose a novel approach effectively utilizing audio-related image information and set up VHASR, a multimodal speech recognition system that uses vision as hotwords to strengthen the model's speech recognition capability. Our system utilizes a dual-stream architecture, which firstly transcribes the text on the two streams separately, and then combines the outputs. We evaluate the proposed model on four datasets: Flickr8k, ADE20k, COCO, and OpenImages. The experimental results show that VHASR can effectively utilize key information in images to enhance the model's speech recognition ability. Its performance not only surpasses unimodal ASR, but also achieves SOTA among existing image-based multimodal ASR.
翻译:基于图像的多模态自动语音识别(ASR)模型通过引入与音频相关的图像来提升语音识别性能。然而,部分研究表明,向模型中引入图像信息并不能有效提升ASR性能。本文提出一种有效利用音频相关图像信息的新方法,并构建了VHASR——一种使用视觉信息作为热词以增强模型语音识别能力的多模态语音识别系统。我们的系统采用双流架构,首先在两个流上分别进行文本转录,随后融合输出结果。我们在四个数据集上评估了所提模型:Flickr8k、ADE20k、COCO和OpenImages。实验结果表明,VHASR能够有效利用图像中的关键信息来增强模型的语音识别能力,其性能不仅超越单模态ASR,更在现有基于图像的多模态ASR中达到了最优水平。