A primary cost driver for training large models is wall-clock training time. We show that popular time estimates based on FLOPs are poor estimates, and construct a more accurate proxy based on memory copies. We show that with some simple accounting, we can estimate the training speed of a transformer model from its hyperparameters. Combined with a scaling law curve like Chinchilla, this lets us estimate the final loss of the model. We fit our estimate to real data with a linear regression, and apply the result to rewrite Chinchilla in terms of a model's estimated training time as opposed to the amount of training data. This gives an expression for the loss in terms of the model's hyperparameters alone. We show that this expression is accurate across a wide range of model hyperparameter values, enabling us to analytically make architectural decisions and train models more efficiently.
翻译:训练大型模型的主要成本驱动因素是实际训练时间。我们证明基于浮点运算次数的常用时间估计方法存在显著偏差,并提出了一种基于内存复制的更精确代理指标。通过简单的计算分析,我们能够从Transformer模型的超参数预测其训练速度。结合Chinchilla等扩展定律曲线,这使我们能够预估模型的最终损失值。我们通过线性回归将估计结果与真实数据拟合,并应用该结果将Chinchilla定律重写为基于模型预估训练时间(而非训练数据量)的表达式。由此推导出仅依赖模型超参数的损失函数表达式。实验表明该表达式在广泛的模型超参数取值范围内均保持准确,使我们能够通过解析方法进行架构决策,从而更高效地训练模型。