Semi-structured content in HTML tables, lists, and infoboxes accounts for a substantial share of factual data on the web, yet the formatting complicates usage, and reliably extracting structured information from them remains challenging. Existing methods either lack generalization or are resource-intensive due to per-page LLM inference. In this paper, we introduce SCRIBES (SCRIpt-Based Semi-Structured Content Extraction at Web-Scale), a novel reinforcement learning framework that leverages layout similarity across webpages within the same site as a reward signal. Instead of processing each page individually, SCRIBES generates reusable extraction scripts that can be applied to groups of structurally similar webpages. Our approach further improves by iteratively training on synthetic annotations from in-the-wild CommonCrawl data. Experiments show that our approach outperforms strong baselines by over 13% in script quality and boosts downstream question answering accuracy by more than 4% for GPT-4o, enabling scalable and resource-efficient web information extraction.
翻译:HTML表格、列表和信息框中的半结构化内容构成了网络上事实数据的很大一部分,然而其格式使得使用变得复杂,从这些内容中可靠地提取结构化信息仍然具有挑战性。现有方法要么缺乏泛化能力,要么由于每页都需要进行LLM推理而资源密集。本文提出了SCRIBES(基于脚本的Web级半结构化内容提取),这是一种新颖的强化学习框架,它利用同一网站内网页间的布局相似性作为奖励信号。SCRIBES不是单独处理每个页面,而是生成可重用的提取脚本,这些脚本可应用于结构相似的网页组。我们的方法通过在来自真实CommonCrawl数据的合成标注上进行迭代训练,进一步提升了性能。实验表明,我们的方法在脚本质量上比强基线高出13%以上,并将GPT-4o的下游问答准确率提升了超过4%,实现了可扩展且资源高效的网络信息提取。