Benchmarking Neural Network Training Algorithms

George E. Dahl,Frank Schneider,Zachary Nado,Naman Agarwal,Chandramouli Shama Sastry,Philipp Hennig,Sourabh Medapati,Runa Eschenhagen,Priya Kasimbeg,Daniel Suo,Juhan Bae,Justin Gilmer,Abel L. Peirson,Bilal Khan,Rohan Anil,Mike Rabbat,Shankar Krishnan,Daniel Snider,Ehsan Amid,Kongtao Chen,Chris J. Maddison,Rakshith Vasudev,Michal Badura,Ankush Garg,Peter Mattson

from arxiv, 102 pages, 8 figures, 41 tables

Training algorithms, broadly construed, are an essential part of every deep learning pipeline. Training algorithm improvements that speed up training across a wide variety of workloads (e.g., better update rules, tuning protocols, learning rate schedules, or data selection schemes) could save time, save computational resources, and lead to better, more accurate, models. Unfortunately, as a community, we are currently unable to reliably identify training algorithm improvements, or even determine the state-of-the-art training algorithm. In this work, using concrete experiments, we argue that real progress in speeding up training requires new benchmarks that resolve three basic challenges faced by empirical comparisons of training algorithms: (1) how to decide when training is complete and precisely measure training time, (2) how to handle the sensitivity of measurements to exact workload details, and (3) how to fairly compare algorithms that require hyperparameter tuning. In order to address these challenges, we introduce a new, competitive, time-to-result benchmark using multiple workloads running on fixed hardware, the AlgoPerf: Training Algorithms benchmark. Our benchmark includes a set of workload variants that make it possible to detect benchmark submissions that are more robust to workload changes than current widely-used methods. Finally, we evaluate baseline submissions constructed using various optimizers that represent current practice, as well as other optimizers that have recently received attention in the literature. These baseline results collectively demonstrate the feasibility of our benchmark, show that non-trivial gaps between methods exist, and set a provisional state-of-the-art for future benchmark submissions to try and surpass.

翻译：训练算法，广义而言，是每个深度学习流水线中不可或缺的组成部分。能够跨多种工作负载加速训练过程的算法改进（例如更优的更新规则、调优协议、学习率调度策略或数据选择方案）不仅可以节省时间和计算资源，还能带来更优、更精准的模型。然而遗憾的是，当前研究社区尚无法可靠地识别训练算法的改进方向，甚至难以确定训练算法的前沿水平。本文通过具体实验论证：要实现训练加速的真正进展，需要建立新型基准以应对训练算法实证比较中面临的三大基础挑战：(1) 如何判定训练完成并精确测量训练时间，(2) 如何应对测量结果对具体工作负载细节的敏感性，以及(3) 如何公平比较需要超参数调优的算法。针对这些挑战，我们提出基于固定硬件多工作负载运行的时效性竞争基准——AlgoPerf：训练算法基准。本基准包含一系列工作负载变体，能够检测出比当前主流方法对负载变化更具鲁棒性的提交方案。最后，我们评估了基于多种优化器构建的基线提交方案，这些优化器既涵盖当前实践常用方法，也包括近期文献中备受关注的优化器。这些基线结果共同验证了本基准的可行性，揭示了不同方法之间存在不可忽视的性能差距，并为后续基准提交方案设立了待超越的暂定前沿标准。