The abundant and heterogeneous nature of web content necessitates automated information extraction, and generating scrapers that can be reused across similar web pages offers an effective solution for scalable data extraction. In this work, we propose Co-Scraper, a two-stage framework capable of handling the hierarchical complexity of long HTML documents. By integrating a query-aware DOM pruning mechanism with stable extraction strategy induction, Co-Scraper can effectively transforms web content into executable programmatic wrappers using a fine-tuned Qwen3-8B model. On the test set of SWDE, Co-Scraper achieves state-of-the-art performance with an F1 score of 94.78% and a reuse success rate of 90.39%. This framework significantly enhances the accuracy and resilience of data extraction, providing a highly efficient approach for web data acquisition tasks.
翻译:网页内容的丰富性与异质性催生了自动化信息抽取的需求,而生成可跨相似网页复用的爬虫为可扩展数据抽取提供了有效方案。本文提出Co-Scraper这一两阶段框架,能够处理长HTML文档的层级复杂性。通过将查询感知的DOM剪枝机制与稳定提取策略归纳相结合,Co-Scraper利用微调后的Qwen3-8B模型,可有效将网页内容转化为可执行的程序化封装器。在SWDE测试集上,Co-Scraper以94.78%的F1分数和90.39%的复用成功率取得了最先进性能。该框架显著提升了数据抽取的准确性与鲁棒性,为网页数据获取任务提供了高效途径。