Web scraping has historically required technical expertise in HTML parsing, session management, and authentication circumvention, which limited large-scale data extraction to skilled developers. We argue that large language models (LLMs) have democratized web scraping, enabling low-skill users to execute sophisticated operations through simple natural language prompts. While extensive benchmarks evaluate these tools under optimal expert conditions, we show that without extensive manual effort, current LLM-based workflows allow novice users to scrape complex websites that would otherwise be inaccessible. We systematically benchmark what everyday users can do with off-the-shelf LLM tools across 35 sites spanning five security tiers, including authentication, anti-bot, and CAPTCHA controls. We devise and evaluate two distinct workflows: (a) LLM-assisted scripting, where users prompt LLMs to generate traditional scraping code but maintain manual execution control, and (b) end-to-end LLM agents, which autonomously navigate and extract data through integrated tool use. Our results demonstrate that end-to-end agents have made complex scraping accessible - requiring as little as a single prompt with minimal refinement (less than 5 changes) to complete workflows. We also highlight scenarios where LLM-assisted scripting may be simpler and faster for static sites. In light of these findings, we provide simple procedures for novices to use these workflows and gauge what adversaries could achieve using these.
翻译:传统网络爬取技术历来要求掌握HTML解析、会话管理和身份验证规避等专业技能,这使大规模数据提取仅限于熟练开发者。我们认为大型语言模型(LLM)已实现网络爬取的民主化,使低技能用户能够通过简单的自然语言指令执行复杂操作。尽管现有基准测试多在专家优化条件下评估这些工具,但我们证明:即使不投入大量人工干预,当前基于LLM的工作流仍能让新手用户爬取原本无法访问的复杂网站。我们系统性地对普通用户使用现成LLM工具的能力进行基准测试,覆盖包含身份验证、反机器人系统和验证码控制等五个安全等级的35个网站。我们设计并评估了两种工作流:(a) LLM辅助脚本生成——用户通过提示LLM生成传统爬取代码,但保持人工执行控制;(b)端到端LLM智能体——通过集成工具使用自主导航并提取数据。实验结果表明,端到端智能体已实现复杂爬取的普及化——仅需单次提示及极少优化(少于5次修改)即可完成工作流。同时我们指出,对于静态网站,LLM辅助脚本生成可能更为简便快捷。基于这些发现,我们为新手用户提供了使用这些工作流的简易流程,并评估潜在攻击者利用这些技术可能达到的效果。