Predicting model performance at larger scales enables the design of training strategies and architectures tailored to specific performance targets. Empirical scaling law research identifies functional forms to aid this prediction task. These describe the relationship between loss and compute using a loss-compute frontier defined by learning curves. Due to the empirical nature of this approach, the computational burden is substantial, making strategic resource allocation essential - yet it remains surprisingly underexplored. In this work, we address this shortcoming by exploring the suitability of Successive Halving (SH) and SH combined with parametric and non-parametric surrogate models. In addition to enabling a more systematic allocation of a given compute budget, our findings show that SH paired with surrogate models yields a set of learning curves that includes one with a lower loss-compute value than what naive uniform allocation or an SH-only approach can obtain. Our experiments demonstrate mean relative improvements of up to 2.84% and 5.47% on real-world and synthetic learning curve datasets. This strategic resource allocation enables us to obtain accurate scaling laws at significantly reduced computational costs, saving up to 98.7% over the traditional exhaustive approach.
翻译:预测更大规模下的模型性能,有助于设计针对特定性能目标的训练策略和架构。经验缩放定律研究通过识别函数形式来辅助这一预测任务,这些函数形式利用由学习曲线定义的损失-计算前沿,描述损失与计算量之间的关系。由于该方法的经验性质,计算负担相当大,使得战略资源分配至关重要——然而,这一领域却出人意料地尚未得到充分探索。在本工作中,我们通过探索连续减半(SH)以及SH与参数化和非参数化代理模型结合使用的适用性,来弥补这一不足。除了能够更系统地分配给定计算预算外,我们的研究结果表明,SH与代理模型结合使用生成的一组学习曲线中,包含了一条损失-计算值低于朴素均匀分配或仅使用SH方法所能获得的曲线。我们的实验表明,在真实世界和合成学习曲线数据集上,平均相对改进分别高达2.84%和5.47%。这种战略资源分配使我们能够在显著降低计算成本的情况下获得准确的缩放定律,相比传统的穷举方法,节省高达98.7%的计算资源。