Accurate prediction of training time in distributed deep learning is crucial for resource allocation, cost estimation, and job scheduling. We observe that the floating-point precision setting is a key determinant of training time, leading to training time variations of ~2.4x over its minimum. However, existing studies on distributed training time prediction rely on static model computation graphs that do not capture precision variations, including mixed precision. According to our experiments, training time prediction without considering precision results in significant prediction errors - reaching up to 147.85% in mean absolute percentage error (MAPE). To address this issue, we propose a precision-aware distributed training time predictor that achieves robust accuracy across diverse precision settings, including mixed precision, with 9.8% MAPE.
翻译:在分布式深度学习中,准确预测训练时间对于资源分配、成本估算和作业调度至关重要。我们观察到,浮点数精度设置是训练时间的决定性因素,导致训练时间相对于其最小值存在约2.4倍的变化。然而,现有的分布式训练时间预测研究依赖于静态模型计算图,无法捕捉精度变化(包括混合精度)。根据我们的实验,不考虑精度的训练时间预测会导致显著的预测误差——平均绝对百分比误差(MAPE)高达147.85%。为解决这一问题,我们提出了一种精度感知的分布式训练时间预测器,该预测器在多种精度设置(包括混合精度)下均能实现稳健的准确性,MAPE仅为9.8%。