Machine learning model performance improvements tend to arise from competition and application. For deployment, we consider prescriptive scaling laws: given a pre-training compute budget, what downstream accuracy is attainable with contemporary post-training practice, and how stable is that mapping as the field evolves? Using large-scale observational evaluations with 5k existing and 2k newly evaluated model checkpoints spanning 2022-2026 across six benchmarks, we estimate capability boundaries, high conditional quantiles of benchmark scores as a function of log pre-training FLOPs, via smoothed quantile regression with a monotone, saturating sigmoid parameterization. We validate temporal reliability by fitting on earlier model generations and evaluating on later releases: across four of six tasks, the out-of-distribution coverage error remains below 2%, while math reasoning exhibits a consistently advancing boundary over time. For instance, at a budget of 10^24 FLOPs, the estimated attainable accuracies are 0.83 on IFEval and 0.54 on MATH Lvl 5. We then extend our approach to analyze task-dependent saturation and to probe contamination-related shifts on math reasoning tasks. Finally, we introduce a balanced I-optimal sampling algorithm that recovers near-full-data frontiers using roughly 20% of the parameter-count-weighted evaluation budget, as low as 5% on some tasks, while maintaining comparable calibration. Together, our work releases Proteus-2k, the latest model performance evaluation dataset, and introduces a practical methodology for translating compute budgets into reliable performance expectations and for monitoring when capability boundaries shift across time.
翻译:机器学习模型性能的提升往往源于竞争与应用。针对部署场景,我们考虑指示性缩放定律:给定预训练计算预算,通过当代后训练实践可获得的下游准确率是多少?随着领域发展,这种映射关系的稳定性如何?基于涵盖2022-2026年六个基准测试的5千个现有模型检查点与2千个新评估检查点的大规模观测评估,我们采用单调饱和S形参数化的平滑分位数回归,以对数预训练FLOPs为函数估计能力边界(即基准测试分数的高条件分位数)。通过拟合早期模型代际并评估后续版本,我们验证了时间可靠性:在六项任务中的四项上,分布外覆盖误差低于2%,而数学推理随时间呈现持续演进的能力边界。例如,在10^24 FLOPs预算下,IFEval的估计可达准确率为0.83,MATH Lvl 5为0.54。随后我们将方法扩展至分析任务相关饱和现象,并探究数学推理任务中与数据污染相关的偏移。最后,我们提出一种平衡I-最优采样算法,该算法使用约20%的参数计数加权评估预算(某些任务低至5%),即可恢复近乎全数据前沿,同时保持可比校准。综合而言,本研究发布了最新模型性能评估数据集Proteus-2k,并引入了一套实用方法论,既可将计算预算转化为可靠的性能预期,也能监测能力边界随时间推移的变迁。