The existing work on the distributed training of machine learning (ML) models has consistently overlooked the distribution of the achieved learning quality, focusing instead on its average value. This leads to a poor dependability}of the resulting ML models, whose performance may be much worse than expected. We fill this gap by proposing DepL, a framework for dependable learning orchestration, able to make high-quality, efficient decisions on (i) the data to leverage for learning, (ii) the models to use and when to switch among them, and (iii) the clusters of nodes, and the resources thereof, to exploit. For concreteness, we consider as possible available models a full DNN and its compressed versions. Unlike previous studies, DepL guarantees that a target learning quality is reached with a target probability, while keeping the training cost at a minimum. We prove that DepL has constant competitive ratio and polynomial complexity, and show that it outperforms the state-of-the-art by over 27% and closely matches the optimum.
翻译:现有关于机器学习模型分布式训练的工作始终忽略了所获学习质量的分布情况,而仅关注其平均值。这导致最终机器学习模型的可靠性较差,其性能可能远低于预期。我们通过提出DepL(一种用于可靠学习编排的框架)来填补这一空白,该框架能够就以下方面做出高质量、高效的决策:(i)用于学习的数据;(ii)使用的模型及切换时机;(iii)利用的节点集群及其资源。为具体说明,我们考虑将完整深度神经网络及其压缩版本作为可用的候选模型。与以往研究不同,DepL保证以目标概率达到目标学习质量,同时将训练成本控制在最低水平。我们证明DepL具有常数竞争比和多项式复杂度,并表明其性能优于现有最优方法超过27%,且接近理论最优值。