Web automation is a significant technique that accomplishes complicated web tasks by automating common web actions, enhancing operational efficiency, and reducing the need for manual intervention. Traditional methods, such as wrappers, suffer from limited adaptability and scalability when faced with a new website. On the other hand, generative agents empowered by large language models (LLMs) exhibit poor performance and reusability in open-world scenarios. In this work, we introduce a crawler generation task for vertical information web pages and the paradigm of combining LLMs with crawlers, which helps crawlers handle diverse and changing web environments more efficiently. We propose AutoCrawler, a two-stage framework that leverages the hierarchical structure of HTML for progressive understanding. Through top-down and step-back operations, AutoCrawler can learn from erroneous actions and continuously prune HTML for better action generation. We conduct comprehensive experiments with multiple LLMs and demonstrate the effectiveness of our framework. Resources of this paper can be found at \url{https://github.com/EZ-hwh/AutoCrawler}
翻译:网页自动化是一项重要技术,通过自动化常见的网页操作完成复杂网络任务,提高运营效率并减少人工干预需求。传统方法(如包装器)在面对新网站时存在适应性和可扩展性有限的问题。而由大语言模型驱动的生成式智能体在开放世界场景中表现出较差的性能和可复用性。在本工作中,我们针对垂直信息类网页提出了一种爬虫生成任务,并引入大语言模型与爬虫结合的新范式,帮助爬虫更高效地应对多样化和动态变化的网络环境。我们提出了AutoCrawler——一种利用HTML层次结构进行渐进式理解的两阶段框架。通过自上而下和回退操作,AutoCrawler能够从错误动作中学习,并持续修剪HTML以生成更优动作。我们使用多种大语言模型进行了全面实验,验证了框架的有效性。本文相关资源可在\url{https://github.com/EZ-hwh/AutoCrawler}获取。