This report evaluates PDF-to-Markdown conversion using recent Vision-Language Models (VLMs) on challenging French documents. Document parsing is a critical step for Retrieval-Augmented Generation (RAG) pipelines, where transcription and layout errors propagate to downstream retrieval and grounding. Existing benchmarks often emphasize English or Chinese and can over-penalize benign formatting and linearization choices (e.g., line breaks, list segmentation, alternative table renderings) that are largely irrelevant for downstream use. We introduce a French-focused benchmark of difficult pages selected via model-disagreement sampling from a corpus of 60{,}000 documents, covering handwritten forms, complex layouts, dense tables, and graphics-rich pages. Evaluation is performed with unit-test-style checks that target concrete failure modes (text presence, reading order, and local table constraints) combined with category-specific normalization designed to discount presentation-only variance. Across 15 models, we observe substantially higher robustness for the strongest proprietary models on handwriting and forms, while several open-weights systems remain competitive on standard printed layouts.
翻译:本报告评估了近期视觉语言模型在处理具有挑战性的法语文档时的PDF到Markdown转换性能。文档解析是检索增强生成流程中的关键步骤,其中的转录和版面识别错误会传播至下游的检索与事实依据获取环节。现有的基准测试通常侧重于英文或中文,并且可能过度惩罚那些对下游应用基本无关的良性格式化和线性化选择(例如换行符、列表分段、替代性表格呈现方式)。我们引入了一个专注于法语的基准测试集,其困难页面通过模型分歧采样从包含60,000份文档的语料库中选取,涵盖了手写表格、复杂版面、密集表格以及富含图形的页面。评估采用单元测试风格的检查进行,这些检查针对具体的失败模式(文本存在性、阅读顺序和局部表格约束),并结合了特定类别的归一化处理,旨在消除仅与呈现方式相关的差异。在15个模型的测试中,我们观察到最强专有模型在手写体和表格处理上具有显著更高的鲁棒性,而多个开放权重系统在标准印刷版面上仍保持竞争力。