GPUs running deep learning (DL) workloads are frequently underutilized. Collocating multiple DL training tasks on the same GPU can improve utilization but introduces two key risks: (1) out-of-memory (OOM) crashes for newly scheduled tasks, and (2) severe performance interference among co-running tasks, which can negate any throughput gains. These issues reduce system robustness, quality of service, and energy efficiency. We present CARMA, a task-level, collocation-aware resource manager for the server-scale. CARMA addresses collocation challenges via (1) fine-grained monitoring and bookkeeping of GPUs and a collocation risk analysis that filters out the high-risk GPUs; (2) task placement policies that cap GPU utilization to limit OOMs and interference; (3) integration of GPU memory need estimators for DL tasks to minimize OOMs during collocation; and (4) a lightweight recovery method that relaunches jobs crashed due to OOMs. Our evaluation on a DL training workload derived from real-world traces shows that CARMA uses GPUs more efficiently by making more informed collocation decisions: for the best-performing collocation policy, CARMA increases GPU streaming multiprocessor (SM) utilization by 54%, the parallelism achieved per SM by 61%, and memory use by 62%. This results in a ~35% and ~15% reduction in the end-to-end execution time (makespan) and GPU energy consumption, respectively, for this workload.
翻译:运行深度学习(DL)工作负载的GPU经常处于未充分利用状态。在同一GPU上协同定位多个DL训练任务可以提高利用率,但会引入两个关键风险:(1)新调度任务出现内存不足(OOM)崩溃;(2)并行运行任务间产生严重的性能干扰,这可能抵消所有吞吐量收益。这些问题降低了系统的鲁棒性、服务质量和能效。我们提出CARMA,一种面向服务器规模的任务级协同定位感知资源管理器。CARMA通过以下方式应对协同定位挑战:(1)对GPU进行细粒度监控与簿记,并通过协同定位风险分析过滤高风险GPU;(2)采用任务放置策略限制GPU利用率以控制OOM和干扰;(3)集成DL任务的GPU内存需求估算器以最小化协同定位期间的OOM;(4)采用轻量级恢复方法重启因OOM崩溃的任务。基于真实场景追踪数据构建的DL训练工作负载评估表明,CARMA通过更明智的协同定位决策更高效地利用GPU:在最佳协同定位策略下,CARMA将GPU流式多处理器(SM)利用率提升54%,单SM实现的并行度提高61%,内存使用率增加62%。这使得该工作负载的端到端执行时间(完工时间)和GPU能耗分别降低约35%和约15%。