Extracting structured data from the web is often a trade-off between the brittle nature of manual heuristics and the prohibitive cost of Large Language Models. We introduce AXE (Adaptive X-Path Extractor), a pipeline that rethinks this process by treating the HTML DOM as a tree that needs pruning rather than just a wall of text to be read. AXE uses a specialized "pruning" mechanism to strip away boilerplate and irrelevant nodes, leaving behind a distilled, high-density context that allows a tiny 0.6B LLM to generate precise, structured outputs. To keep the model honest, we implement Grounded XPath Resolution (GXR), ensuring every extraction is physically traceable to a source node. Despite its low footprint, AXE achieves state-of-the-art zero-shot performance, outperforming several much larger, fully-trained alternatives with an F1 score of 88.1% on the SWDE dataset. By releasing our specialized adaptors, we aim to provide a practical, cost-effective path for large-scale web information extraction.
翻译:从网页中抽取结构化数据往往需要在脆弱的手工启发式规则与高昂的大语言模型成本之间进行权衡。我们提出AXE(自适应XPath抽取器),该流程通过将HTML DOM视为一棵需要修剪的树而非仅需读取的文本墙,重新思考了这一过程。AXE采用一种专门的“修剪”机制来剥离样板文本和无关节点,留下经过提纯的高密度上下文,使得仅0.6B参数的小型LLM能够生成精确的结构化输出。为确保模型的可信性,我们实现了基于XPath的实体定位机制,保证每次抽取都能在物理上追溯到源节点。尽管模型体量极小,AXE在零样本设置下取得了最先进的性能,在SWDE数据集上以88.1%的F1分数超越了多个参数量更大、经过完整训练的替代方案。通过开源我们专门设计的适配器,我们旨在为大规模网页信息抽取提供一条实用且高性价比的技术路径。