Prescriptive Scaling Reveals the Evolution of Language Model Capabilities

Machine learning model performance improvements tend to arise from competition and application. For deployment, we consider prescriptive scaling laws: given a pre-training compute budget, what downstream accuracy is attainable with contemporary post-training practice, and how stable is that mapping as the field evolves? Using large-scale observational evaluations with 5k existing and 2k newly evaluated model checkpoints spanning 2022-2026 across six benchmarks, we estimate capability boundaries, high conditional quantiles of benchmark scores as a function of log pre-training FLOPs, via smoothed quantile regression with a monotone, saturating sigmoid parameterization. We validate temporal reliability by fitting on earlier model generations and evaluating on later releases: across four of six tasks, the out-of-distribution coverage error remains below 2%, while math reasoning exhibits a consistently advancing boundary over time. For instance, at a budget of 10^24 FLOPs, the estimated attainable accuracies are 0.83 on IFEval and 0.54 on MATH Lvl 5. We then extend our approach to analyze task-dependent saturation and to probe contamination-related shifts on math reasoning tasks. Finally, we introduce a balanced I-optimal sampling algorithm that recovers near-full-data frontiers using roughly 20% of the parameter-count-weighted evaluation budget, as low as 5% on some tasks, while maintaining comparable calibration. Together, our work releases Proteus-2k, the latest model performance evaluation dataset, and introduces a practical methodology for translating compute budgets into reliable performance expectations and for monitoring when capability boundaries shift across time.

翻译：机器学习模型性能的提升往往源于竞争与应用。针对部署场景，我们考虑指示性缩放定律：给定预训练计算预算，通过当代后训练实践可获得的下游准确率是多少？随着领域发展，这种映射关系的稳定性如何？基于涵盖2022-2026年六个基准测试的5千个现有模型检查点与2千个新评估检查点的大规模观测评估，我们采用单调饱和S形参数化的平滑分位数回归，以对数预训练FLOPs为函数估计能力边界（即基准测试分数的高条件分位数）。通过拟合早期模型代际并评估后续版本，我们验证了时间可靠性：在六项任务中的四项上，分布外覆盖误差低于2%，而数学推理随时间呈现持续演进的能力边界。例如，在10^24 FLOPs预算下，IFEval的估计可达准确率为0.83，MATH Lvl 5为0.54。随后我们将方法扩展至分析任务相关饱和现象，并探究数学推理任务中与数据污染相关的偏移。最后，我们提出一种平衡I-最优采样算法，该算法使用约20%的参数计数加权评估预算（某些任务低至5%），即可恢复近乎全数据前沿，同时保持可比校准。综合而言，本研究发布了最新模型性能评估数据集Proteus-2k，并引入了一套实用方法论，既可将计算预算转化为可靠的性能预期，也能监测能力边界随时间推移的变迁。