Characterizing and predicting the training performance of modern machine learning (ML) workloads on compute systems with compute and communication spread between CPUs, GPUs, and network devices is not only the key to optimization and planning but also a complex goal to achieve. The primary challenges include the complexity of synchronization and load balancing between CPUs and GPUs, the variance in input data distribution, and the use of different communication devices and topologies (e.g., NVLink, PCIe, network cards) that connect multiple compute devices, coupled with the desire for flexible training configurations. Built on top of our prior work for single-GPU platforms, we address these challenges and enable multi-GPU performance modeling by incorporating (1) data-distribution-aware performance models for embedding table lookup, and (2) data movement prediction of communication collectives, into our upgraded performance modeling pipeline equipped with inter-and intra-rank synchronization for ML workloads trained on multi-GPU platforms. Beyond accurately predicting the per-iteration training time of DLRM models with random configurations with a geomean error of 5.21% on two multi-GPU platforms, our prediction pipeline generalizes well to other types of ML workloads, such as Transformer-based NLP models with a geomean error of 3.00%. Moreover, even without actually running ML workloads like DLRMs on the hardware, it is capable of generating insights such as quickly selecting the fastest embedding table sharding configuration (with a success rate of 85%).
翻译:现代机器学习(ML)工作负载在CPU、GPU与网络设备间计算与通信交织的异构计算系统上进行训练时,其性能表征与预测不仅是优化与规划的关键,也是一项复杂的挑战。主要难点包括:CPU与GPU间同步及负载均衡的复杂性、输入数据分布的差异性、连接多计算设备的通信设备及拓扑结构(如NVLink、PCIe、网卡)的多样性,以及对灵活训练配置的需求。基于我们在单GPU平台上的先前工作,我们通过以下方式解决了这些挑战并实现了多GPU性能建模:(1) 将数据分布感知的嵌入表查找性能模型,(2) 通信集合体的数据移动预测,集成到我们升级后的性能建模流水线中,该流水线支持多GPU平台ML工作负载的秩间与秩内同步。我们的预测流水线不仅能在两个多GPU平台上以5.21%的几何平均误差准确预测随机配置下DLRM模型的每轮训练时间,还能良好泛化至其他ML工作负载类型(如基于Transformer的NLP模型,几何平均误差为3.00%)。此外,即使不在硬件上实际运行DLRM等ML工作负载,该流水线也能生成洞察,例如快速挑选最优的嵌入表分片配置(成功率达85%)。