The web contains large-scale, diverse, and abundant information to satisfy the information-seeking needs of humans. Through meticulous data collection, preprocessing, and curation, webpages can be used as a fundamental data resource for language model pretraining. However, when confronted with the progressively revolutionized and intricate nature of webpages, rule-based/feature-based web scrapers are becoming increasingly inadequate. This paper presents a simple, fast, and effective Neural web Scraper (NeuScraper) to help extract primary and clean text contents from webpages. Experimental results show that NeuScraper surpasses the baseline scrapers by achieving more than a 20% improvement, demonstrating its potential in extracting higher-quality data to facilitate the language model pretraining. All of the code is available at https://github.com/OpenMatch/NeuScraper.
翻译:网络中存在大规模、多样且丰富的信息,能够满足人类的信息检索需求。通过精细的数据收集、预处理和整理,网页可作为语言模型预训练的基础数据资源。然而,面对网页日益变革和复杂的特性,基于规则/特征的网页抓取工具逐渐显得力不从心。本文提出了一种简单、快速且有效的神经网页抓取器(NeuScraper),用于从网页中提取主要且干净的文本内容。实验结果表明,NeuScraper 相比基线抓取工具实现了超过20%的性能提升,彰显了其在提取更高质量数据以促进语言模型预训练方面的潜力。所有代码均可在 https://github.com/OpenMatch/NeuScraper 获取。