Deep learning training is an expensive process that extensively uses GPUs, but not all model training saturates modern powerful GPUs. Multi-Instance GPU (MIG) is a new technology introduced by NVIDIA that can partition a GPU to better-fit workloads that do not require all the memory and compute resources of a full GPU. In this paper, we examine the performance of a MIG-enabled A100 GPU under deep learning workloads containing various sizes and combinations of models. We contrast the benefits of MIG to older workload collocation methods on GPUs: na\"ively submitting multiple processes on the same GPU and utilizing Multi-Process Service (MPS). Our results demonstrate that collocating multiple model training runs may yield significant benefits. In certain cases, it can lead up to four times training throughput despite increased epoch time. On the other hand, the aggregate memory footprint and compute needs of the models trained in parallel must fit the available memory and compute resources of the GPU. MIG can be beneficial thanks to its interference-free partitioning, especially when the sizes of the models align with the MIG partitioning options. MIG's rigid partitioning, however, may create sub-optimal GPU utilization for more dynamic mixed workloads. In general, we recommend MPS as the best performing and most flexible form of collocation for model training for a single user submitting training jobs.
翻译:深度学习训练过程成本高昂且广泛使用GPU,但并非所有模型训练都能充分利用现代高性能GPU的算力。多实例GPU(MIG)是NVIDIA推出的新技术,可将GPU划分为多个实例,以更适配无需完整GPU内存和计算资源的负载。本文研究了支持MIG的A100 GPU在包含不同规模与组合模型的深度学习工作负载下的性能表现,并对比了MIG与两种传统GPU任务共用方案(在同一GPU上通过朴素多进程提交任务以及使用多进程服务MPS)的效益。结果表明:共用多个模型训练任务可带来显著收益——尽管单轮训练时间(epoch time)有所增加,但在特定场景下训练吞吐量最高可提升四倍。另一方面,并行训练的模型总内存占用与计算需求必须适配GPU的可用资源。得益于其无干扰分区特性,当模型规模与MIG分区选项匹配时,MIG具有显著优势;但其刚性分区结构可能导致动态混合型工作负载的GPU利用率次优。总体而言,我们推荐MPS作为单用户提交训练任务时性能最优、灵活性最高的任务共用方案。