Offloading communication to existing direct memory access (DMA) engines, available on most state-of-the-art commercial GPUs, has emerged as an interesting and low-cost solution to efficiently overlap computation and communication in machine learning (ML). That said, so far, the reach of DMA offloads has been limited to bandwidth-bound scenarios only (10s of MB to GB transfer sizes). In this work, we aim to break this barrier and expand the reach of DMA communication offloads to even latency-bound regions (KB to low MB). Specifically, we discuss in this work hitherto untapped features available in the state-of-the-art AMD Instinct$^{\mathrm{TM}}$ MI300X GPUs that render DMA communication offloads competitive even for latency-bound regions. We demonstrate the efficacy of these features at the operator-level (ML communication collectives such as all-gather and all-to-all), and also at the end-to-end workload-level (LLM inference). For the former, our optimized DMA offloads close up to 4.5$\times$ performance gap and deliver additional power savings (3-10%) for ML collectives as compared to state-of-the-art GPU core-based communication library, RCCL. For the latter, we demonstrate acceleration for LLM inference: up to 1.5$\times$ lower latency and up to 1.9$\times$ higher throughput over the state-of-the-art vLLM inference framework. We conclude with a discussion of AMD Instinct GPU runtime innovations that stand to expose these features and additionally identify future hardware-software co-design potential to further improve DMA offload efficiency.
翻译:利用现有最先进商用GPU上的直接内存访问(DMA)引擎卸载通信,已发展为一种经济高效的解决方案,可在机器学习(ML)中有效实现计算与通信的重叠。然而,迄今为止,DMA卸载的适用范围仅限于带宽受限场景(传输规模从数十MB到GB级)。本研究旨在打破这一限制,将DMA通信卸载的适用范围扩展至延迟敏感区域(KB到低MB级)。具体而言,我们探讨了当前最先进的AMD Instinct$^{\mathrm{TM}}$ MI300X GPU中迄今未被利用的特性,这些特性使DMA通信卸载即使在延迟敏感区域也能保持竞争力。我们分别在算子层面(如all-gather和all-to-all等ML通信集合)和端到端工作负载层面(LLM推理)验证了这些特性的有效性。在算子层面,与基于GPU核心的先进通信库RCCL相比,我们的优化DMA卸载将ML集合的性能差距缩小了4.5倍以上,并额外节省了3-10%的功耗。在端到端层面,我们展示了LLM推理加速效果:与先进的vLLM推理框架相比,延迟降低达1.5倍,吞吐量提升达1.9倍。最后,我们讨论了AMD Instinct GPU运行时创新如何支持这些特性,并进一步指出了未来软硬件协同设计的潜力,以持续提升DMA卸载效率。