Recent advancements in multimodal large language models (MLLMs) have made significant progress in integrating information across various modalities, yet real-world applications in educational and scientific domains remain challenging. This paper introduces the Multimodal Scientific ASR (MS-ASR) task, which focuses on transcribing scientific conference videos by leveraging visual information from slides to enhance the accuracy of technical terminologies. Realized that traditional metrics like WER fall short in assessing performance accurately, prompting the proposal of severity-aware WER (SWER) that considers the content type and severity of ASR errors. We propose the Scientific Vision Augmented ASR (SciVASR) framework as a baseline method, enabling MLLMs to improve transcript quality through post-editing. Evaluations of state-of-the-art MLLMs, including GPT-4o, show a 45% improvement over speech-only baselines, highlighting the importance of multimodal information integration.
翻译:近年来,多模态大语言模型(MLLMs)在跨模态信息融合方面取得了显著进展,然而其在教育和科学领域的实际应用仍面临挑战。本文提出了多模态科学自动语音识别(MS-ASR)任务,该任务专注于通过利用幻灯片中的视觉信息来转录科学会议视频,以提高技术术语的识别准确性。研究发现传统评估指标如词错误率(WER)在准确评估性能方面存在不足,因此提出了考虑内容类型和错误严重程度的严重性感知词错误率(SWER)。我们提出了科学视觉增强自动语音识别(SciVASR)框架作为基线方法,使MLLMs能够通过后编辑提升转录质量。对包括GPT-4o在内的前沿MLLMs的评估显示,其性能较纯语音基线提升了45%,突显了多模态信息整合的重要性。