As high-quality public text approaches exhaustion, a phenomenon known as the Data Wall, pre-training is shifting from more tokens to better tokens. However, existing methods either rely on heuristic static filters that ignore training dynamics, or use dynamic yet optimizer-agnostic criteria based on raw gradients. We propose OPUS (Optimizer-induced Projected Utility Selection), a dynamic data selection framework that defines utility in the optimizer-induced update space. OPUS scores candidates by projecting their effective updates, shaped by modern optimizers, onto a target direction derived from a stable, in-distribution proxy. To ensure scalability, we employ Ghost technique with CountSketch for computational efficiency, and Boltzmann sampling for data diversity, incurring only 4.7\% additional compute overhead. OPUS achieves remarkable results across diverse corpora, quality tiers, optimizers, and model scales. In pre-training of GPT-2 Large/XL on FineWeb and FineWeb-Edu with 30B tokens, OPUS outperforms industrial-level baselines and even full 200B-token training. Moreover, when combined with industrial-level static filters, OPUS further improves pre-training efficiency, even with lower-quality data. Furthermore, in continued pre-training of Qwen3-8B-Base on SciencePedia, OPUS achieves superior performance using only 0.5B tokens compared to full training with 3B tokens, demonstrating significant data efficiency gains in specialized domains.
翻译:随着高质量公开文本资源趋于枯竭(即“数据墙”现象),预训练范式正从追求更多标记转向寻求更优标记。然而,现有方法要么依赖于忽视训练动态的启发式静态过滤器,要么采用基于原始梯度、虽具动态性却与优化器无关的评判标准。本文提出OPUS(优化器诱导的投影效用选择),一种在优化器诱导的更新空间中定义效用的动态数据选择框架。OPUS通过将候选数据在现代优化器作用下形成的有效更新,投影到源自稳定、同分布代理目标的方向上,从而对候选数据进行评分。为确保可扩展性,我们采用结合CountSketch的Ghost技术以提升计算效率,并利用玻尔兹曼采样保障数据多样性,仅产生4.7%的额外计算开销。OPUS在多样化的语料库、质量层级、优化器及模型规模上均取得了显著成果。在FineWeb与FineWeb-Edu语料上使用300亿标记对GPT-2 Large/XL进行预训练时,OPUS超越了工业级基线方法,甚至优于使用完整2000亿标记的训练效果。此外,当与工业级静态过滤器结合使用时,OPUS能进一步提升预训练效率,即使在数据质量较低的情况下亦然。进一步地,在SciencePedia语料上对Qwen3-8B-Base进行持续预训练时,OPUS仅使用5亿标记即达到了使用完整30亿标记训练所获得的性能,展示了其在专业领域中显著的数据效率优势。