Web scraping is a powerful technique that extracts data from websites, enabling automated data collection, enhancing data analysis capabilities, and minimizing manual data entry efforts. Existing methods, wrappers-based methods suffer from limited adaptability and scalability when faced with a new website, while language agents, empowered by large language models (LLMs), exhibit poor reusability in diverse web environments. In this work, we introduce the paradigm of generating web scrapers with LLMs and propose AutoScraper, a two-stage framework that can handle diverse and changing web environments more efficiently. AutoScraper leverages the hierarchical structure of HTML and similarity across different web pages for generating web scrapers. Besides, we propose a new executability metric for better measuring the performance of web scraper generation tasks. We conduct comprehensive experiments with multiple LLMs and demonstrate the effectiveness of our framework. Resources of this paper can be found at \url{https://github.com/EZ-hwh/AutoScraper}
翻译:网络爬虫是一种从网站提取数据的强大技术,能够实现自动化数据收集、增强数据分析能力并减少人工数据录入的工作量。现有方法中,基于包装器的方法在面对新网站时适应性和可扩展性有限,而由大型语言模型(LLMs)驱动的语言智能体则在多样化的网络环境中表现出较差的复用性。本文提出了利用LLMs生成网络爬虫的新范式,并介绍了AutoScraper——一个能够更高效处理多样且动态变化网络环境的两阶段框架。AutoScraper利用HTML的层次化结构以及不同网页间的相似性来生成网络爬虫。此外,我们提出了一种新的可执行性度量标准,以更好地衡量网络爬虫生成任务的性能。我们使用多种LLMs进行了全面的实验,验证了所提框架的有效性。本文相关资源可在 \url{https://github.com/EZ-hwh/AutoScraper} 获取。