Revolutionizing Retrieval-Augmented Generation with Enhanced PDF Structure Recognition

With the rapid development of Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) has become a predominant method in the field of professional knowledge-based question answering. Presently, major foundation model companies have opened up Embedding and Chat API interfaces, and frameworks like LangChain have already integrated the RAG process. It appears that the key models and steps in RAG have been resolved, leading to the question: are professional knowledge QA systems now approaching perfection? This article discovers that current primary methods depend on the premise of accessing high-quality text corpora. However, since professional documents are mainly stored in PDFs, the low accuracy of PDF parsing significantly impacts the effectiveness of professional knowledge-based QA. We conducted an empirical RAG experiment across hundreds of questions from the corresponding real-world professional documents. The results show that, ChatDOC, a RAG system equipped with a panoptic and pinpoint PDF parser, retrieves more accurate and complete segments, and thus better answers. Empirical experiments show that ChatDOC is superior to baseline on nearly 47% of questions, ties for 38% of cases, and falls short on only 15% of cases. It shows that we may revolutionize RAG with enhanced PDF structure recognition.

翻译：随着大语言模型（LLMs）的快速发展，检索增强生成（RAG）已成为专业知识问答领域的主流方法。当前，主流基础模型公司已开放Embedding和Chat API接口，LangChain等框架也已集成RAG流程。这似乎表明RAG中的关键模型与步骤已得到解决，由此引发疑问：专业知识问答系统是否已趋近完美？本文发现，现有主流方法依赖于获取高质量文本语料的前提条件。然而，由于专业文档多以PDF格式存储，PDF解析精度不足严重影响了专业知识问答的效果。我们针对真实专业文档中的数百个问题开展了实证性RAG实验。结果表明，配备全景精确定位PDF解析器的RAG系统ChatDOC能够检索到更准确、更完整的文本片段，从而给出更优的答案。实证实验显示，ChatDOC在近47%的问题上优于基线模型，在38%的情况下与基线持平，仅15%的问题表现逊于基线。这证明，通过增强型PDF结构识别，我们或可彻底革新RAG技术。

相关内容

CASES

关注 4

CASES：International Conference on Compilers, Architectures, and Synthesis for Embedded Systems。 Explanation：嵌入式系统编译器、体系结构和综合国际会议。 Publisher：ACM。 SIT： http://dblp.uni-trier.de/db/conf/cases/index.html

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

14+阅读 · 2022年3月12日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日