First-come first-serve scheduling can result in substantial (up to 10%) of transiently idle nodes on supercomputers. Recognizing that such unfilled nodes are well-suited for deep neural network (DNN) training, due to the flexible nature of DNN training tasks, Liu et al. proposed that the re-scaling DNN training tasks to fit gaps in schedules be formulated as a mixed-integer linear programming (MILP) problem, and demonstrated via simulation the potential benefits of the approach. Here, we introduce MalleTrain, a system that provides the first practical implementation of this approach and that furthermore generalizes it by allowing it use even for DNN training applications for which model information is unknown before runtime. Key to this latter innovation is the use of a lightweight online job profiling advisor (JPA) to collect critical scalability information for DNN jobs -- information that it then employs to optimize resource allocations dynamically, in real time. We describe the MalleTrain architecture and present the results of a detailed experimental evaluation on a supercomputer GPU cluster and several representative DNN training workloads, including neural architecture search and hyperparameter optimization. Our results not only confirm the practical feasibility of leveraging idle supercomputer nodes for DNN training but improve significantly on prior results, improving training throughput by up to 22.3\% without requiring users to provide job scalability information.
翻译:先到先服务调度可能导致超级计算机上出现大量(高达10%)的瞬时空闲节点。考虑到深度神经网络(DNN)训练任务的灵活性,Liu等人提出将重新调整DNN训练任务以适应调度间隙的问题形式化为混合整数线性规划(MILP)问题,并通过仿真验证了该方法的潜在优势。本文介绍了MalleTrain系统,这是该方法的首个实际实现,并在此基础上进行了泛化,使其甚至适用于运行时模型信息未知的DNN训练应用。后一项创新的关键在于采用轻量级在线作业分析顾问(JPA)来收集DNN作业的关键可扩展性信息,并利用这些信息动态实时优化资源分配。我们描述了MalleTrain的架构,并在超级计算机GPU集群上开展了详细的实验评估,涉及包括神经架构搜索和超参数优化在内的多种代表性DNN训练负载。实验结果不仅证实了利用空闲超级计算机节点进行DNN训练的实践可行性,而且显著优于先前的研究成果,在无需用户提供作业可扩展性信息的情况下,训练吞吐量提升高达22.3%。