Medical document OCR is challenging due to complex layouts, domain-specific terminology, and noisy annotations, while requiring strict field-level exact matching. Existing OCR systems and general-purpose vision-language models often fail to reliably parse such documents. We propose MeDocVL, a post-trained vision-language model for query-driven medical document parsing. Our framework combines Training-driven Label Refinement to construct high-quality supervision from noisy annotations, with a Noise-aware Hybrid Post-training strategy that integrates reinforcement learning and supervised fine-tuning to achieve robust and precise extraction. Experiments on medical invoice benchmarks show that MeDocVL consistently outperforms conventional OCR systems and strong VLM baselines, achieving state-of-the-art performance under noisy supervision.
翻译:医疗文档OCR因版面复杂、领域术语专业、标注噪声大且需满足严格的字段级精确匹配而极具挑战性。现有OCR系统与通用视觉语言模型往往难以可靠解析此类文档。本文提出MeDocVL,一种用于查询驱动式医疗文档解析的后训练视觉语言模型。该框架结合了训练驱动的标签优化方法——从含噪声标注中构建高质量监督信号,以及噪声感知混合后训练策略——融合强化学习与监督微调以实现鲁棒且精确的信息抽取。在医疗票据基准测试上的实验表明,MeDocVL持续优于传统OCR系统及现有强视觉语言模型基线,在噪声监督下取得了最先进的性能。