Ensuring the highest training throughput to maximize resource efficiency, while maintaining fairness among users, is critical for deep learning (DL) training in heterogeneous GPU clusters. However, current DL schedulers provide only limited fairness properties and suboptimal training throughput, impeding tenants from effectively leveraging heterogeneous resources. The underlying design challenge stems from inherent conflicts between efficiency and fairness properties. In this paper, we introduce OEF, a new resource allocation framework specifically developed for achieving optimal resource efficiency and ensuring diverse fairness properties in heterogeneous GPU clusters. By integrating resource efficiency and fairness within a global optimization framework, OEF is capable of providing users with maximized overall efficiency, as well as various guarantees of fairness, in both cooperative and non-cooperative environments. We have implemented OEF in a cluster resource manager and conducted large-scale experiments, showing that OEF can improve the overall training throughput by up to 32% while improving fairness compared to state-of-the-art heterogeneity-aware schedulers.
翻译:确保最高的训练吞吐量以最大化资源效率,同时维持用户间的公平性,对于异构GPU集群中的深度学习训练至关重要。然而,当前的深度学习调度器仅提供有限的公平性属性和次优的训练吞吐量,这阻碍了用户有效利用异构资源。其根本设计挑战源于效率与公平性之间的固有冲突。本文提出了OEF——一种专为在异构GPU集群中实现最优资源效率并确保多样化公平性属性而设计的全新资源分配框架。通过将资源效率与公平性整合到全局优化框架中,OEF能够在合作与非合作环境中为用户提供最大化的整体效率以及多种公平性保障。我们已在集群资源管理器中实现OEF,并进行了大规模实验。结果表明,与最先进的异构感知调度器相比,OEF可将整体训练吞吐量提升高达32%,同时改善公平性。