High-quality main content extraction from web pages is a critical prerequisite for constructing large-scale training corpora. While traditional heuristic extractors are efficient, they lack the semantic reasoning required to handle the structural heterogeneity of the modern web. Conversely, well-pretrained generative Large Language Models (LLMs) offer superior document comprehension but are prohibited by excessive computational costs, limited context windows, and hallucination risks when applied at web scale. We present \textbf{Dripper}, a lightweight framework that resolves these bottlenecks through four contributions: (1) We reformulate extraction as a \textbf{constrained sequence labeling} task using SLMs (Small Language Models). This paradigm eliminates generative hallucinations and achieves exceptional efficiency, reaching a throughput of 3.08 pages per second on a single A100 GPU. (2) We construct \textbf{WebMainBench}, a rigorous benchmark of 7,809 human-annotated pages covering 5,434 unique domains and multiple languages. Evaluations show our Dripper-0.6B model \textbf{outperforms} heuristics like Trafilatura and rivals massive models like DeepSeek-V3.2(685B), GPT-5 and Gemini-2.5-Pro, offering an optimal efficiency-accuracy trade-off. (3) We demonstrate infrastructural value by \textbf{pre-training a 1B model} on a Dripper-curated corpus (63B tokens). This model significantly outperforms baselines in downstream tasks, proving the critical role of extraction quality and the effectiveness of our framework. (4) We \textbf{open-source} the Dripper-0.6B weights and codebase to facilitate the construction of high-quality datasets.
翻译:从网页中提取高质量主要内容是构建大规模训练语料库的关键前提。传统的启发式提取器虽然高效,但缺乏处理现代网络结构异质性所需的语义推理能力。相反,经过良好预训练的生成式大语言模型(LLMs)虽具备优异的文档理解能力,但在网络规模应用时却受限于过高的计算成本、有限的上下文窗口以及幻觉风险。本文提出\textbf{Dripper},一个轻量级框架,通过四项贡献解决这些瓶颈:(1)我们将提取任务重新定义为使用小语言模型(SLMs)的\textbf{约束序列标注}任务。该范式消除了生成式幻觉,并实现了卓越的效率,在单张A100 GPU上达到每秒3.08页的吞吐量。(2)我们构建了\textbf{WebMainBench},一个包含7,809个人工标注页面、涵盖5,434个独立域名及多种语言的严格基准。评估表明,我们的Dripper-0.6B模型\textbf{优于}Trafilatura等启发式方法,并与DeepSeek-V3.2(685B)、GPT-5和Gemini-2.5-Pro等巨型模型相媲美,提供了最优的效率-精度权衡。(3)我们通过在Dripper构建的语料库(630亿词元)上\textbf{预训练一个10亿参数模型},证明了其基础设施价值。该模型在下游任务中显著超越基线,证明了提取质量的关键作用以及我们框架的有效性。(4)我们\textbf{开源}了Dripper-0.6B的权重和代码库,以促进高质量数据集的构建。