AutoCrawler: A Progressive Understanding Web Agent for Web Crawler Generation

Web automation is a significant technique that accomplishes complicated web tasks by automating common web actions, enhancing operational efficiency, and reducing the need for manual intervention. Traditional methods, such as wrappers, suffer from limited adaptability and scalability when faced with a new website. On the other hand, generative agents empowered by large language models (LLMs) exhibit poor performance and reusability in open-world scenarios. In this work, we introduce a crawler generation task for vertical information web pages and the paradigm of combining LLMs with crawlers, which helps crawlers handle diverse and changing web environments more efficiently. We propose AutoCrawler, a two-stage framework that leverages the hierarchical structure of HTML for progressive understanding. Through top-down and step-back operations, AutoCrawler can learn from erroneous actions and continuously prune HTML for better action generation. We conduct comprehensive experiments with multiple LLMs and demonstrate the effectiveness of our framework. Resources of this paper can be found at \url{https://github.com/EZ-hwh/AutoCrawler}

翻译：网页自动化是一项重要技术，通过自动化常见的网页操作完成复杂网络任务，提高运营效率并减少人工干预需求。传统方法（如包装器）在面对新网站时存在适应性和可扩展性有限的问题。而由大语言模型驱动的生成式智能体在开放世界场景中表现出较差的性能和可复用性。在本工作中，我们针对垂直信息类网页提出了一种爬虫生成任务，并引入大语言模型与爬虫结合的新范式，帮助爬虫更高效地应对多样化和动态变化的网络环境。我们提出了AutoCrawler——一种利用HTML层次结构进行渐进式理解的两阶段框架。通过自上而下和回退操作，AutoCrawler能够从错误动作中学习，并持续修剪HTML以生成更优动作。我们使用多种大语言模型进行了全面实验，验证了框架的有效性。本文相关资源可在\url{https://github.com/EZ-hwh/AutoCrawler}获取。

相关内容

网络爬虫

关注 13

网络爬虫（又被称为网页蜘蛛，网络机器人，在FOAF社区中间，更经常被称为网页追逐者），是一种按照一定的规则，自动的抓取万维网信息的程序或者脚本，已被广泛应用于互联网领域。搜索引擎使用网络爬虫抓取Web网页、文档甚至图片、音频、视频等资源，通过相应的索引技术组织这些信息，提供给搜索用户进行查询。网络爬虫也为中小站点的推广提供了有效的途径。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日