Multi-Modal automatic speech recognition (ASR) techniques aim to leverage additional modalities to improve the performance of speech recognition systems. While existing approaches primarily focus on video or contextual information, the utilization of extra supplementary textual information has been overlooked. Recognizing the abundance of online conference videos with slides, which provide rich domain-specific information in the form of text and images, we release SlideSpeech, a large-scale audio-visual corpus enriched with slides. The corpus contains 1,705 videos, 1,000+ hours, with 473 hours of high-quality transcribed speech. Moreover, the corpus contains a significant amount of real-time synchronized slides. In this work, we present the pipeline for constructing the corpus and propose baseline methods for utilizing text information in the visual slide context. Through the application of keyword extraction and contextual ASR methods in the benchmark system, we demonstrate the potential of improving speech recognition performance by incorporating textual information from supplementary video slides.
翻译:多模态自动语音识别(ASR)技术旨在利用额外模态提升语音识别系统的性能。现有方法主要聚焦于视频或上下文信息,而利用额外补充文本信息的研究尚未得到充分关注。鉴于线上会议视频中常伴随幻灯片,这些幻灯片以文本和图像形式提供了丰富的领域特定信息,我们发布了SlideSpeech——首个大规模幻灯片增强音视频语料库。该语料库包含1705个视频,总时长超过1000小时,其中473小时的语音经过高质量转写。此外,语料库中还包含大量实时同步的幻灯片。本文阐述了语料库的构建流程,并提出了利用视觉幻灯片中文本信息的基线方法。通过基准系统中关键词提取与上下文ASR技术的应用,我们证明了将补充视频幻灯片中的文本信息融入语音识别可有效提升系统性能。