Look before Transcription: End-to-End SlideASR with Visually-Anchored Policy Optimization

Automatic speech recognition (ASR) systems often struggle with domain-specific terminology, especially in specialized settings such as academic lectures. To address this, we define the SlideASR task, which leverages the rich visual information from presentation slides to improve transcription accuracy. Existing pipeline methods for this task tend to be complex and underperform. Although omni-modal large language models (OLLMs) provide a promising end-to-end framework, they frequently fail in practice by degenerating into simple optical character recognition (OCR) systems. To overcome this, we propose Visually-Anchored Policy Optimization (VAPO), a novel post-training method designed to control the model's reasoning process. Drawing on the Chain-of-Thought reasoning paradigm, VAPO enforces a structured "Look before Transcription" procedure using a <think><answer> format. Specifically, the model first performs OCR on the slide content within the think step, then generates the transcription by referencing this recognized visual information in the answer step. This reasoning process is optimized via reinforcement learning with four distinct rewards targeting format compliance, OCR accuracy, ASR quality, and visual anchoring consistency. To support further research, we construct SlideASR-Bench, a new entity-rich benchmark consisting of a synthetic dataset for training and testing, and a challenging real-world set for evaluation. Extensive experiments demonstrate that VAPO significantly improves recognition of domain-specific terms, establishing an effective end-to-end paradigm for SlideASR.

翻译：自动语音识别（ASR）系统在处理领域特定术语时常常面临困难，尤其是在学术讲座等专业场景中。为解决这一问题，我们定义了SlideASR任务，该任务利用演示幻灯片中丰富的视觉信息来提升转录准确性。现有的流水线方法通常结构复杂且性能欠佳。尽管全模态大语言模型（OLLMs）提供了一个有前景的端到端框架，但在实践中它们常退化为简单的光学字符识别（OCR）系统。为克服此局限，我们提出了视觉锚定策略优化（VAPO），这是一种新颖的后训练方法，旨在控制模型的推理过程。借鉴思维链推理范式，VAPO通过<think><answer>格式强制执行结构化的“先看后转录”流程。具体而言，模型首先在think步骤中对幻灯片内容执行OCR，随后在answer步骤中参考已识别的视觉信息生成转录文本。该推理过程通过强化学习进行优化，其中包含四种不同的奖励机制，分别针对格式合规性、OCR准确性、ASR质量及视觉锚定一致性。为支持进一步研究，我们构建了SlideASR-Bench——一个富含实体的新基准数据集，包含用于训练和测试的合成数据集以及一个用于评估的、具有挑战性的真实场景数据集。大量实验表明，VAPO显著提升了对领域特定术语的识别能力，为SlideASR建立了一种有效的端到端范式。