Retrieval-Augmented Generation (RAG) systems depend critically on the quality of document preprocessing, yet no prior study has evaluated PDF processing frameworks by their impact on downstream question-answering accuracy. We address this gap through a systematic comparison of four open-source PDF-to-Markdown conversion frameworks, Docling, MinerU, Marker, and DeepSeek OCR, across 21 pipeline configurations, varying the conversion tool, cleaning transformations, splitting strategy, and metadata enrichment. Evaluation was performed using a 50-question benchmark over a corpus of 36 Portuguese administrative documents (1706 pages, ~492K words), with LLM-as-judge scoring over 50 independent runs per configuration. Statistical significance was assessed via Wilcoxon signed-rank tests with Cohen's d effect sizes. Two baselines bounded the results: naïve PDFLoader (86.2%) and manually curated Markdown (91.3%). Docling with hierarchical splitting and image descriptions achieved the highest automated accuracy (94.1 +/- 1.6%), surpassing even manual curation. A per-question-type analysis revealed that table-dependent questions drive the largest accuracy differences, with a 33-percentage-point gap between basic and hierarchical splitting. Metadata enrichment and hierarchy-aware chunking contributed more to accuracy than the conversion framework alone. An exploratory GraphRAG implementation underperformed basic RAG (82% vs. 94.1%). These findings demonstrate that data preparation quality is the dominant factor in RAG system performance.
翻译:检索增强生成(RAG)系统的性能高度依赖文档预处理质量,然而目前尚无研究从下游问答准确性角度评估PDF处理框架。为填补这一空白,我们系统比较了四种开源PDF到Markdown转换框架——Docling、MinerU、Marker与DeepSeek OCR,通过21种流水线配置,系统性地变换转换工具、清洗变换、分块策略及元数据增强方法。基于包含36份葡萄牙语管理文档(共1706页、约49.2万词)的语料库,我们采用包含50个问题的基准测试集进行评估,每个配置独立运行50次后以LLM作为评判者进行评分,并通过Wilcoxon符号秩检验结合Cohen's d效应量评估统计显著性。两个基线界定了性能边界:朴素PDFLoader(86.2%)与人工精修Markdown(91.3%)。采用层次化分块与图像描述的Docling方案实现了最高自动化准确率(94.1±1.6%),甚至超越人工精修效果。按问题类型的分析表明,表格依赖型问题导致最大准确率差异,基础分块与层次化分块之间准确率差距达33个百分点。元数据增强与层次感知分块对准确率的贡献度超过转换框架本身。探索性GraphRAG实现性能低于基础RAG(82%对比94.1%)。这些发现表明,数据准备质量是决定RAG系统性能的主导因素。