From PDF to RAG-Ready: Evaluating Document Conversion Frameworks for Domain-Specific Question Answering

José Guilherme Marques dos Santos,Ricardo Yang,Rui Humberto Pereira,Alexandre Sousa,Brígida Mónica Faria,Henrique Lopes Cardoso,José Duarte,José Luís Reis,Luís Paulo Reis,Pedro Pimenta,José Paulo Marques dos Santos

from arxiv, 27 pages, 3 figures, 7 tables

Retrieval-Augmented Generation (RAG) systems depend critically on the quality of document preprocessing, yet no prior study has evaluated PDF processing frameworks by their impact on downstream question-answering accuracy. We address this gap through a systematic comparison of four open-source PDF-to-Markdown conversion frameworks, Docling, MinerU, Marker, and DeepSeek OCR, across 21 pipeline configurations, varying the conversion tool, cleaning transformations, splitting strategy, and metadata enrichment. Evaluation was performed using a 50-question benchmark over a corpus of 36 Portuguese administrative documents (1706 pages, ~492K words), with LLM-as-judge scoring over 50 independent runs per configuration. Statistical significance was assessed via Wilcoxon signed-rank tests with Cohen's d effect sizes. Two baselines bounded the results: naïve PDFLoader (86.2%) and manually curated Markdown (91.3%). Docling with hierarchical splitting and image descriptions achieved the highest automated accuracy (94.1 +/- 1.6%), surpassing even manual curation. A per-question-type analysis revealed that table-dependent questions drive the largest accuracy differences, with a 33-percentage-point gap between basic and hierarchical splitting. Metadata enrichment and hierarchy-aware chunking contributed more to accuracy than the conversion framework alone. An exploratory GraphRAG implementation underperformed basic RAG (82% vs. 94.1%). These findings demonstrate that data preparation quality is the dominant factor in RAG system performance.

翻译：检索增强生成（RAG）系统的性能高度依赖文档预处理质量，然而目前尚无研究从下游问答准确性角度评估PDF处理框架。为填补这一空白，我们系统比较了四种开源PDF到Markdown转换框架——Docling、MinerU、Marker与DeepSeek OCR，通过21种流水线配置，系统性地变换转换工具、清洗变换、分块策略及元数据增强方法。基于包含36份葡萄牙语管理文档（共1706页、约49.2万词）的语料库，我们采用包含50个问题的基准测试集进行评估，每个配置独立运行50次后以LLM作为评判者进行评分，并通过Wilcoxon符号秩检验结合Cohen's d效应量评估统计显著性。两个基线界定了性能边界：朴素PDFLoader（86.2%）与人工精修Markdown（91.3%）。采用层次化分块与图像描述的Docling方案实现了最高自动化准确率（94.1±1.6%），甚至超越人工精修效果。按问题类型的分析表明，表格依赖型问题导致最大准确率差异，基础分块与层次化分块之间准确率差距达33个百分点。元数据增强与层次感知分块对准确率的贡献度超过转换框架本身。探索性GraphRAG实现性能低于基础RAG（82%对比94.1%）。这些发现表明，数据准备质量是决定RAG系统性能的主导因素。