The cloud is still a popular platform for distributed deep learning (DL) training jobs since resource sharing in the cloud can improve resource utilization and reduce overall costs. However, such sharing also brings multiple challenges for DL training jobs, e.g., high-priority jobs could impact, even interrupt, low-priority jobs. Meanwhile, most existing distributed DL training systems require users to configure the resources (i.e., the number of nodes and resources like CPU and memory allocated to each node) of jobs manually before job submission and can not adjust the job's resources during the runtime. The resource configuration of a job deeply affect this job's performance (e.g., training throughput, resource utilization, and completion rate). However, this usually leads to poor performance of jobs since users fail to provide optimal resource configuration in most cases. \system~is a distributed DL framework can auto-configure a DL job's initial resources and dynamically tune the job's resources to win the better performance. With elastic capability, \system~can effectively adjusts the resources of a job when there are performance issues detected or a job fails because of faults or eviction. Evaluations results show \system~can outperform manual well-tuned resource configurations. Furthermore, in the production Kubernetes cluster of \company, \system~reduces the medium of job completion time by 31\%, and improves the job completion rate by 6\%, CPU utilization by 15\%, and memory utilization by 20\% compared with manual configuration.
翻译:云仍然是分布式深度学习训练任务的流行平台,因为云中的资源共享可以提高资源利用率并降低总体成本。然而,这种共享也给深度学习训练任务带来了多重挑战,例如高优先级任务可能影响甚至中断低优先级任务。同时,大多数现有的分布式深度学习训练系统要求用户在任务提交前手动配置任务资源(即节点数量以及分配给每个节点的CPU、内存等资源),并且无法在运行期间调整任务资源。任务的资源配置会深刻影响其性能(例如训练吞吐量、资源利用率和完成率)。然而,由于用户通常无法提供最优的资源配置,这往往会导致任务性能不佳。\system~是一种分布式深度学习框架,能够自动配置深度学习任务的初始资源,并动态调整任务资源以获得更优性能。凭借弹性能力,\system~能在检测到性能问题或任务因故障或驱逐而失败时,有效调整任务资源。评估结果表明,\system~的性能优于手动调整的资源配置。此外,在\company~的生产Kubernetes集群中,与手动配置相比,\system~将任务完成时间的中位数降低了31%,任务完成率提高了6%,CPU利用率提高了15%,内存利用率提高了20%。