OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration

As high-quality public text approaches exhaustion, a phenomenon known as the Data Wall, pre-training is shifting from more tokens to better tokens. However, existing methods either rely on heuristic static filters that ignore training dynamics, or use dynamic yet optimizer-agnostic criteria based on raw gradients. We propose OPUS (Optimizer-induced Projected Utility Selection), a dynamic data selection framework that defines utility in the optimizer-induced update space. OPUS scores candidates by projecting their effective updates, shaped by modern optimizers, onto a target direction derived from a stable, in-distribution proxy. To ensure scalability, we employ Ghost technique with CountSketch for computational efficiency, and Boltzmann sampling for data diversity, incurring only 4.7\% additional compute overhead. OPUS achieves remarkable results across diverse corpora, quality tiers, optimizers, and model scales. In pre-training of GPT-2 Large/XL on FineWeb and FineWeb-Edu with 30B tokens, OPUS outperforms industrial-level baselines and even full 200B-token training. Moreover, when combined with industrial-level static filters, OPUS further improves pre-training efficiency, even with lower-quality data. Furthermore, in continued pre-training of Qwen3-8B-Base on SciencePedia, OPUS achieves superior performance using only 0.5B tokens compared to full training with 3B tokens, demonstrating significant data efficiency gains in specialized domains.

翻译：随着高质量公开文本资源趋于枯竭（即“数据墙”现象），预训练范式正从追求更多标记转向寻求更优标记。然而，现有方法要么依赖于忽视训练动态的启发式静态过滤器，要么采用基于原始梯度、虽具动态性却与优化器无关的评判标准。本文提出OPUS（优化器诱导的投影效用选择），一种在优化器诱导的更新空间中定义效用的动态数据选择框架。OPUS通过将候选数据在现代优化器作用下形成的有效更新，投影到源自稳定、同分布代理目标的方向上，从而对候选数据进行评分。为确保可扩展性，我们采用结合CountSketch的Ghost技术以提升计算效率，并利用玻尔兹曼采样保障数据多样性，仅产生4.7%的额外计算开销。OPUS在多样化的语料库、质量层级、优化器及模型规模上均取得了显著成果。在FineWeb与FineWeb-Edu语料上使用300亿标记对GPT-2 Large/XL进行预训练时，OPUS超越了工业级基线方法，甚至优于使用完整2000亿标记的训练效果。此外，当与工业级静态过滤器结合使用时，OPUS能进一步提升预训练效率，即使在数据质量较低的情况下亦然。进一步地，在SciencePedia语料上对Qwen3-8B-Base进行持续预训练时，OPUS仅使用5亿标记即达到了使用完整30亿标记训练所获得的性能，展示了其在专业领域中显著的数据效率优势。