An Analysis of Collocation on GPUs for Deep Learning Training

Deep learning training is an expensive process that extensively uses GPUs, but not all model training saturates modern powerful GPUs. Multi-Instance GPU (MIG) is a new technology introduced by NVIDIA that can partition a GPU to better-fit workloads that do not require all the memory and compute resources of a full GPU. In this paper, we examine the performance of a MIG-enabled A100 GPU under deep learning workloads containing various sizes and combinations of models. We contrast the benefits of MIG to older workload collocation methods on GPUs: na\"ively submitting multiple processes on the same GPU and utilizing Multi-Process Service (MPS). Our results demonstrate that collocating multiple model training runs may yield significant benefits. In certain cases, it can lead up to four times training throughput despite increased epoch time. On the other hand, the aggregate memory footprint and compute needs of the models trained in parallel must fit the available memory and compute resources of the GPU. MIG can be beneficial thanks to its interference-free partitioning, especially when the sizes of the models align with the MIG partitioning options. MIG's rigid partitioning, however, may create sub-optimal GPU utilization for more dynamic mixed workloads. In general, we recommend MPS as the best performing and most flexible form of collocation for model training for a single user submitting training jobs.

翻译：深度学习训练过程成本高昂且广泛使用GPU，但并非所有模型训练都能充分利用现代高性能GPU的算力。多实例GPU（MIG）是NVIDIA推出的新技术，可将GPU划分为多个实例，以更适配无需完整GPU内存和计算资源的负载。本文研究了支持MIG的A100 GPU在包含不同规模与组合模型的深度学习工作负载下的性能表现，并对比了MIG与两种传统GPU任务共用方案（在同一GPU上通过朴素多进程提交任务以及使用多进程服务MPS）的效益。结果表明：共用多个模型训练任务可带来显著收益——尽管单轮训练时间（epoch time）有所增加，但在特定场景下训练吞吐量最高可提升四倍。另一方面，并行训练的模型总内存占用与计算需求必须适配GPU的可用资源。得益于其无干扰分区特性，当模型规模与MIG分区选项匹配时，MIG具有显著优势；但其刚性分区结构可能导致动态混合型工作负载的GPU利用率次优。总体而言，我们推荐MPS作为单用户提交训练任务时性能最优、灵活性最高的任务共用方案。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/