With the rapid development of Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) has become a predominant method in the field of professional knowledge-based question answering. Presently, major foundation model companies have opened up Embedding and Chat API interfaces, and frameworks like LangChain have already integrated the RAG process. It appears that the key models and steps in RAG have been resolved, leading to the question: are professional knowledge QA systems now approaching perfection? This article discovers that current primary methods depend on the premise of accessing high-quality text corpora. However, since professional documents are mainly stored in PDFs, the low accuracy of PDF parsing significantly impacts the effectiveness of professional knowledge-based QA. We conducted an empirical RAG experiment across hundreds of questions from the corresponding real-world professional documents. The results show that, ChatDOC, a RAG system equipped with a panoptic and pinpoint PDF parser, retrieves more accurate and complete segments, and thus better answers. Empirical experiments show that ChatDOC is superior to baseline on nearly 47% of questions, ties for 38% of cases, and falls short on only 15% of cases. It shows that we may revolutionize RAG with enhanced PDF structure recognition.
翻译:随着大语言模型(LLMs)的快速发展,检索增强生成(RAG)已成为专业知识问答领域的主流方法。当前,主流基础模型公司已开放Embedding和Chat API接口,LangChain等框架也已集成RAG流程。这似乎表明RAG中的关键模型与步骤已得到解决,由此引发疑问:专业知识问答系统是否已趋近完美?本文发现,现有主流方法依赖于获取高质量文本语料的前提条件。然而,由于专业文档多以PDF格式存储,PDF解析精度不足严重影响了专业知识问答的效果。我们针对真实专业文档中的数百个问题开展了实证性RAG实验。结果表明,配备全景精确定位PDF解析器的RAG系统ChatDOC能够检索到更准确、更完整的文本片段,从而给出更优的答案。实证实验显示,ChatDOC在近47%的问题上优于基线模型,在38%的情况下与基线持平,仅15%的问题表现逊于基线。这证明,通过增强型PDF结构识别,我们或可彻底革新RAG技术。