Large Language Models (LLMs) demonstrate remarkable capabilities in replicating human tasks and boosting productivity. However, their direct application for data extraction presents limitations due to a prioritisation of fluency over factual accuracy and a restricted ability to manipulate specific information. Therefore to overcome these limitations, this research leverages the knowledge representation power of pre-trained LLMs and the targeted information access enabled by RAG models, this research investigates a general-purpose accurate data scraping recipe for RAG models designed for language generation. To capture knowledge in a more modular and interpretable way, we use pre trained language models with a latent knowledge retriever, which allows the model to retrieve and attend over documents from a large corpus. We utilised RAG model architecture and did an in-depth analysis of their capabilities under three tasks: (i) Semantic Classification of HTML elements, (ii) Chunking HTML text for effective understanding, and (iii) comparing results from different LLMs and ranking algorithms. While previous work has developed dedicated architectures and training procedures for HTML understanding and extraction, we show that LLMs pre-trained on standard natural language with an addition of effective chunking, searching and ranking algorithms, can prove to be efficient data scraping tool to extract complex data from unstructured text. Future research directions include addressing the challenges of provenance tracking and dynamic knowledge updates within the proposed RAG-based data extraction framework. By overcoming these limitations, this approach holds the potential to revolutionise data extraction from vast repositories of textual information.
翻译:大型语言模型(LLMs)在复现人类任务和提升生产力方面展现出卓越能力。然而,由于模型优先考虑流畅性而非事实准确性,且处理特定信息的能力有限,将其直接应用于数据提取存在局限性。为克服这些限制,本研究结合预训练LLMs的知识表征能力与检索增强生成(RAG)模型的有针对性信息访问机制,探索了一种面向语言生成的通用精确数据抓取方案。为以更模块化、可解释的方式捕获知识,我们采用带有潜在知识检索器的预训练语言模型,使模型能够从大规模语料库中检索并关注相关文档。基于RAG模型架构,我们深入分析了其在三项任务中的能力:(i)HTML元素的语义分类,(ii)为有效理解而分块处理HTML文本,以及(iii)比较不同LLMs与排序算法的输出结果。尽管已有研究针对HTML理解与提取开发了专用架构和训练流程,但我们证明:在标准自然语言预训练的LLMs基础上,结合有效的分块、搜索和排序算法,可成为从非结构化文本中提取复杂数据的高效抓取工具。未来研究方向包括:在提出的基于RAG的数据提取框架内解决来源追踪和动态知识更新的挑战。通过突破这些限制,该方法有望革新海量文本信息库的数据提取技术。