The web contains large-scale, diverse, and abundant information to satisfy the information-seeking needs of humans. Through meticulous data collection, preprocessing, and curation, webpages can be used as a fundamental data resource for language model pretraining. However, when confronted with the progressively revolutionized and intricate nature of webpages, rule-based/feature-based web scrapers are becoming increasingly inadequate. This paper presents a simple, fast, and effective Neural web Scraper (NeuScraper) to help extract primary and clean text contents from webpages. Experimental results show that NeuScraper surpasses the baseline scrapers by achieving more than a 20% improvement, demonstrating its potential in extracting higher-quality data to facilitate the language model pretraining. All of the code is available at https://github.com/OpenMatch/NeuScraper.
翻译:网络蕴含着大规模、多样且丰富的信息,能够满足人类的信息检索需求。通过精细的数据收集、预处理与整理,网页可作为语言模型预训练的基础数据资源。然而,面对日益革新且结构复杂的网页,基于规则或特征的网页爬取工具已逐渐难以满足需求。本文提出了一种简单、快速且有效的神经网页爬取器(NeuScraper),用于从网页中提取主体且干净的文本内容。实验结果表明,NeuScraper 相较于基线爬取工具实现了超过 20% 的性能提升,展现了其在提取更高质量数据以促进语言模型预训练方面的潜力。全部代码已发布于 https://github.com/OpenMatch/NeuScraper。