The performance of modern language models depends critically on pretraining data composition. Yet existing data selection methods rely on auxiliary classifiers for document scoring or mixture optimization, adding computational overhead and dependence on labeled data. We propose WebGraphMix, a lightweight data selection framework that computes structural centrality scores over the Common Crawl host-level web graph and uses them to vary the proportion of central versus peripheral documents in the pretraining mixture. We hypothesize that central hosts expose models to reusable abstractions, while peripheral hosts encode specialized, long-tail knowledge. WebGraphMix computes centrality scores efficiently at web scale, requiring no model training, labeled data, or downstream supervision. We integrate WebGraphMix into the DataComp-LM pipeline and train models at 400M and 1B parameter scales with 8B and 28B tokens respectively, evaluating on 23 tasks ranging from factual knowledge to symbolic reasoning. Our experiments show that central and peripheral web regions encode complementary capabilities. Mixture combining both at a ratio of 1:1 achieves 41.4% on average, compared to 39.8% for uniform sampling. Combining structural scores with document-level quality classifier scores further improves performance to 43.8%. These findings demonstrate that web graph topology is a meaningful axis for pretraining data curation, capturing information that is largely orthogonal to existing content-based approaches.
翻译:摘要:现代语言模型的性能关键依赖于预训练数据的构成。然而,现有数据选择方法需借助辅助分类器进行文档评分或混合优化,增加了计算开销并对标注数据产生依赖。我们提出WebGraphMix——一种轻量级数据选择框架,该框架在Common Crawl宿主级网页图上计算结构中心性分数,并基于此调节预训练混合数据中中心文档与边缘文档的比例。我们假设中心宿主为模型提供可复用的抽象表征,而边缘宿主编码专业化、长尾知识。WebGraphMix可在网络规模下高效计算中心性分数,无需模型训练、标注数据或下游监督。我们将WebGraphMix集成到DataComp-LM流程中,分别使用80亿和280亿token训练4亿参数与10亿参数规模的模型,并在23项涵盖事实知识到符号推理的任务上进行评估。实验表明,中心与边缘网络区域编码互补能力:采用1:1比例的混合数据平均性能达41.4%,优于均匀采样的39.8%。将结构分数与文档级质量分类器分数结合后,性能进一步提升至43.8%。这些发现证明网页图拓扑结构是预训练数据策展的重要维度,其捕获的信息与现有基于内容的方法基本正交。