Retrieval-augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating external knowledge to reduce hallucinations and incorporate up-to-date information without retraining. As an essential part of RAG, external knowledge bases are commonly built by extracting structured data from unstructured PDF documents using Optical Character Recognition (OCR). However, given the imperfect prediction of OCR and the inherent non-uniform representation of structured data, knowledge bases inevitably contain various OCR noises. In this paper, we introduce OHRBench, the first benchmark for understanding the cascading impact of OCR on RAG systems. OHRBench includes 350 carefully selected unstructured PDF documents from six real-world RAG application domains, along with Q&As derived from multimodal elements in documents, challenging existing OCR solutions used for RAG To better understand OCR's impact on RAG systems, we identify two primary types of OCR noise: Semantic Noise and Formatting Noise and apply perturbation to generate a set of structured data with varying degrees of each OCR noise. Using OHRBench, we first conduct a comprehensive evaluation of current OCR solutions and reveal that none is competent for constructing high-quality knowledge bases for RAG systems. We then systematically evaluate the impact of these two noise types and demonstrate the vulnerability of RAG systems. Furthermore, we discuss the potential of employing Vision-Language Models (VLMs) without OCR in RAG systems. Code: https://github.com/opendatalab/OHR-Bench
翻译:检索增强生成(RAG)通过整合外部知识来增强大型语言模型(LLMs),从而减少幻觉现象并纳入最新信息而无需重新训练。作为RAG的关键组成部分,外部知识库通常通过使用光学字符识别(OCR)从非结构化PDF文档中提取结构化数据来构建。然而,鉴于OCR预测的不完美性以及结构化数据固有的非均匀表示特性,知识库不可避免地包含各类OCR噪声。本文提出首个用于理解OCR对RAG系统级联影响的基准测试集OHRBench,该基准包含从六个真实世界RAG应用领域精选的350份非结构化PDF文档,以及源自文档中多模态元素的问答对,旨在挑战现有用于RAG的OCR解决方案。为深入理解OCR对RAG系统的影响,我们识别出两种主要OCR噪声类型:语义噪声与格式噪声,并通过扰动生成具有不同程度各类OCR噪声的结构化数据集。基于OHRBench,我们首先对当前OCR解决方案进行全面评估,发现现有方案均无法胜任构建高质量RAG知识库的任务。随后我们系统评估了这两类噪声的影响,揭示了RAG系统的脆弱性。此外,我们探讨了在RAG系统中采用无需OCR的视觉语言模型(VLMs)的潜力。代码:https://github.com/opendatalab/OHR-Bench