The conventional use of the Retrieval-Augmented Generation (RAG) architecture has proven effective for retrieving information from diverse documents. However, challenges arise in handling complex table queries, especially within PDF documents containing intricate tabular structures.This research introduces an innovative approach to enhance the accuracy of complex table queries in RAG-based systems. Our methodology involves storing PDFs in the retrieval database and extracting tabular content separately. The extracted tables undergo a process of context enrichment, concatenating headers with corresponding values. To ensure a comprehensive understanding of the enriched data, we employ a fine-tuned version of the Llama-2-chat language model for summarisation within the RAG architecture. Furthermore, we augment the tabular data with contextual sense using the ChatGPT 3.5 API through a one-shot prompt. This enriched data is then fed into the retrieval database alongside other PDFs. Our approach aims to significantly improve the precision of complex table queries, offering a promising solution to a longstanding challenge in information retrieval.
翻译:传统检索增强生成(RAG)架构在从多样化文档中检索信息方面已被证明有效。然而,在处理复杂表格查询时,尤其是包含复杂表格结构的PDF文档中,仍面临挑战。本研究提出了一种创新方法,以提升基于RAG系统中复杂表格查询的准确性。我们的方法包括:将PDF文件存储于检索数据库中,并单独提取其中的表格内容。提取后的表格经过上下文丰富化处理——将表头与对应数值进行拼接。为确保对丰富化数据的全面理解,我们在RAG架构中采用经过微调的Llama-2-chat语言模型进行摘要生成。此外,通过ChatGPT 3.5 API的单次提示(one-shot prompt),我们为表格数据注入情境语义。这些丰富化后的数据与其他PDF文件一并存入检索数据库。该方法旨在显著提升复杂表格查询的精度,为信息检索领域长期存在的难题提供了一种有前景的解决方案。