Concurrent computation and communication (C3) is a pervasive paradigm in ML and other domains, making its performance optimization crucial. In this paper, we carefully characterize C3 in ML on GPUs, which are most widely deployed for ML training and inference. We observe that while C3 leads to performance uplifts, the uplifts are far lower than ideal speedups (serial computation and communication versus maximum of computation or communication; all times from isolated executions). C3 on average achieves only 21% of ideal speedup, this is due to known challenges of compute and memory interference between concurrent GPU kernels (that is, sharing of GPU's compute units, caches and HBM). To attain better performance for C3, first, we evaluate dual strategies of schedule prioritization and careful resource partitioning of compute units on GPUs to push performance attained with C3 (on average 42% of ideal speedup). We also provide heuristics that can guide a runtime while employing these strategies. To further enhance C3 performance, we propose to mitigate C3 interference by offloading communication tasks to the GPU's DMA engines. To this end, we build Concurrent Communication CoLlectives (ConCCL) proof-of-concepts that harness DMA engines for communication. We show how ConCCL considerably closes the gap between realized and ideal speedup for C3 (on average 72% of ideal speedup is realized, up to 1.67x speedup). Overall, our work makes a strong case for GPU DMA engine advancements to better support C3 on GPUs.
翻译:并发计算与通信(C3)是机器学习及其他领域中普遍存在的范式,其性能优化至关重要。本文针对GPU上机器学习的C3范式进行了细致刻画,GPU是目前机器学习训练与推理中部署最广泛的计算设备。我们观察到,虽然C3能带来性能提升,但其提升幅度远低于理想加速比(串行计算与通信时间对比计算或通信最大时间的比值;所有时间均基于独立执行测得)。C3平均仅实现理想加速比的21%,这归因于并发GPU内核间计算与内存干扰的已知挑战(即对GPU计算单元、缓存及高带宽存储器的共享)。为提升C3性能,我们首先评估了两种策略:调度优先级分配与GPU计算单元的精细资源分区,以提升C3实现的性能(平均达到理想加速比的42%)。同时提出了可指导运行时系统实施这些策略的启发式方法。为进一步增强C3性能,我们提出通过将通信任务卸载至GPU的DMA引擎来缓解C3干扰。为此,我们构建了利用DMA引擎进行通信的并发通信集合操作(ConCCL)概念验证系统。实验表明ConCCL显著缩小了C3实际加速与理想加速的差距(平均实现理想加速比的72%,最高达1.67倍加速比)。总体而言,本研究为推进GPU DMA引擎发展以更好支持GPU上的C3范式提供了有力论据。