Scale has become a main ingredient in obtaining strong machine learning models. As a result, understanding a model's scaling properties is key to effectively designing both the right training setup as well as future generations of architectures. In this work, we argue that scale and training research has been needlessly complex due to reliance on the cosine schedule, which prevents training across different lengths for the same model size. We investigate the training behavior of a direct alternative - constant learning rate and cooldowns - and find that it scales predictably and reliably similar to cosine. Additionally, we show that stochastic weight averaging yields improved performance along the training trajectory, without additional training costs, across different scales. Importantly, with these findings we demonstrate that scaling experiments can be performed with significantly reduced compute and GPU hours by utilizing fewer but reusable training runs.
翻译:规模已成为获得强大机器学习模型的关键要素。因此,理解模型的缩放特性对于有效设计合适的训练设置以及未来架构迭代至关重要。本工作指出,由于对余弦调度器的依赖,规模与训练研究变得不必要的复杂,该调度器阻碍了相同模型规模下不同训练时长的跨长度训练。我们研究了一种直接替代方案——恒定学习率与冷却阶段——的训练行为,发现其缩放行为与余弦调度器类似,具有可预测性和可靠性。此外,我们证明随机权重平均能够在不同规模下,沿训练轨迹提升性能,且无需额外训练成本。重要的是,基于这些发现,我们证明了通过使用更少但可重复利用的训练轮次,缩放实验能够以显著减少的计算量和GPU时数完成。