One of the first pre-processing steps for constructing web-scale LLM pretraining datasets involves extracting text from HTML. Despite the immense diversity of web content, existing open-source datasets predominantly apply a single fixed extractor to all webpages. In this work, we investigate whether this practice leads to suboptimal coverage and utilization of Internet data. We first show that while different extractors may lead to similar model performance on standard language understanding tasks, the pages surviving a fixed filtering pipeline can differ substantially. This suggests a simple intervention: by taking a Union over different extractors, we can increase the token yield of DCLM-Baseline by up to 71% while maintaining benchmark performance. We further show that for structured content such as tables and code blocks, extractor choice can significantly impact downstream task performance, with differences of up to 10 percentage points (p.p.) on WikiTQ and 3 p.p. on HumanEval.
翻译:构建网络规模大语言模型预训练数据集的首要预处理步骤之一是从HTML中提取文本。尽管网络内容具有极大的多样性,但现有的开源数据集主要对全部网页应用单一的固定提取器。在本研究中,我们探讨了这种做法是否导致对互联网数据的覆盖和利用不够优化。我们首先证明,虽然不同的提取器可能在标准语言理解任务上带来相似的模型性能,但通过固定过滤流程保留下来的页面可能存在显著差异。这提示了一个简单的干预措施:通过对不同提取器结果取并集,我们能够将DCLM-Baseline的标记产出提升高达71%,同时保持基准测试性能。我们进一步证明,对于表格和代码块等结构化内容,提取器的选择会显著影响下游任务性能,在WikiTQ上差异可达10个百分点,在HumanEval上差异可达3个百分点。