Leveraging Large Language Models for Web Scraping

Large Language Models (LLMs) demonstrate remarkable capabilities in replicating human tasks and boosting productivity. However, their direct application for data extraction presents limitations due to a prioritisation of fluency over factual accuracy and a restricted ability to manipulate specific information. Therefore to overcome these limitations, this research leverages the knowledge representation power of pre-trained LLMs and the targeted information access enabled by RAG models, this research investigates a general-purpose accurate data scraping recipe for RAG models designed for language generation. To capture knowledge in a more modular and interpretable way, we use pre trained language models with a latent knowledge retriever, which allows the model to retrieve and attend over documents from a large corpus. We utilised RAG model architecture and did an in-depth analysis of their capabilities under three tasks: (i) Semantic Classification of HTML elements, (ii) Chunking HTML text for effective understanding, and (iii) comparing results from different LLMs and ranking algorithms. While previous work has developed dedicated architectures and training procedures for HTML understanding and extraction, we show that LLMs pre-trained on standard natural language with an addition of effective chunking, searching and ranking algorithms, can prove to be efficient data scraping tool to extract complex data from unstructured text. Future research directions include addressing the challenges of provenance tracking and dynamic knowledge updates within the proposed RAG-based data extraction framework. By overcoming these limitations, this approach holds the potential to revolutionise data extraction from vast repositories of textual information.

翻译：大型语言模型（LLMs）在复现人类任务和提升生产力方面展现出卓越能力。然而，由于模型优先考虑流畅性而非事实准确性，且处理特定信息的能力有限，将其直接应用于数据提取存在局限性。为克服这些限制，本研究结合预训练LLMs的知识表征能力与检索增强生成（RAG）模型的有针对性信息访问机制，探索了一种面向语言生成的通用精确数据抓取方案。为以更模块化、可解释的方式捕获知识，我们采用带有潜在知识检索器的预训练语言模型，使模型能够从大规模语料库中检索并关注相关文档。基于RAG模型架构，我们深入分析了其在三项任务中的能力：（i）HTML元素的语义分类，（ii）为有效理解而分块处理HTML文本，以及（iii）比较不同LLMs与排序算法的输出结果。尽管已有研究针对HTML理解与提取开发了专用架构和训练流程，但我们证明：在标准自然语言预训练的LLMs基础上，结合有效的分块、搜索和排序算法，可成为从非结构化文本中提取复杂数据的高效抓取工具。未来研究方向包括：在提出的基于RAG的数据提取框架内解决来源追踪和动态知识更新的挑战。通过突破这些限制，该方法有望革新海量文本信息库的数据提取技术。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

Graph Transformer近期进展

专知会员服务

65+阅读 · 2023年1月5日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日