Audio-visual speech recognition (AVSR) is a multimodal extension of automatic speech recognition (ASR), using video as a complement to audio. In AVSR, considerable efforts have been directed at datasets for facial features such as lip-readings, while they often fall short in evaluating the image comprehension capabilities in broader contexts. In this paper, we construct SlideAVSR, an AVSR dataset using scientific paper explanation videos. SlideAVSR provides a new benchmark where models transcribe speech utterances with texts on the slides on the presentation recordings. As technical terminologies that are frequent in paper explanations are notoriously challenging to transcribe without reference texts, our SlideAVSR dataset spotlights a new aspect of AVSR problems. As a simple yet effective baseline, we propose DocWhisper, an AVSR model that can refer to textual information from slides, and confirm its effectiveness on SlideAVSR.
翻译:视听语音识别(AVSR)是自动语音识别(ASR)的多模态扩展,其利用视频作为音频的补充。在AVSR领域,大量研究工作集中于唇读等面部特征的数据集构建,但这些数据集往往难以评估模型在更广泛场景下的图像理解能力。本文构建了SlideAVSR,一个基于学术论文讲解视频的AVSR数据集。SlideAVSR提供了一个新的基准测试场景:模型需根据演示录像中的幻灯片文本来转录语音内容。由于论文讲解中频繁出现的专业术语在没有参考文本的情况下极难准确转录,我们的SlideAVSR数据集凸显了AVSR问题的一个新维度。作为简单而有效的基线模型,我们提出了DocWhisper——一种能够参考幻灯片文本信息的AVSR模型,并在SlideAVSR上验证了其有效性。