Audio-visual speech recognition (AVSR) is a multimodal extension of automatic speech recognition (ASR), using video as a complement to audio. In AVSR, considerable efforts have been directed at datasets for facial features such as lip-readings, while they often fall short in evaluating the image comprehension capabilities in broader contexts. In this paper, we construct SlideAVSR, an AVSR dataset using scientific paper explanation videos. SlideAVSR provides a new benchmark where models transcribe speech utterances with texts on the slides on the presentation recordings. As technical terminologies that are frequent in paper explanations are notoriously challenging to transcribe without reference texts, our SlideAVSR dataset spotlights a new aspect of AVSR problems. As a simple yet effective baseline, we propose DocWhisper, an AVSR model that can refer to textual information from slides, and confirm its effectiveness on SlideAVSR.
翻译:音视频语音识别(AVSR)是自动语音识别(ASR)的多模态扩展,利用视频作为音频的补充。在AVSR领域,大量研究聚焦于唇读等面部特征数据集,但这些数据集往往难以评估更广泛语境下的图像理解能力。本文构建了SlideAVSR——一个基于科学论文讲解视频的AVSR数据集。SlideAVSR提供了一个全新基准,要求模型在演示录制场景中,结合幻灯片上的文字内容转录语音。由于论文讲解中频繁出现的专业术语在缺乏参考文本时极具转录难度,我们的SlideAVSR数据集凸显了AVSR问题的新维度。作为简洁有效的基线方法,我们提出了DocWhisper模型——一种能够参考幻灯片文本信息的AVSR模型,并在SlideAVSR上验证了其有效性。