Modern GPU systems are constantly evolving to meet the needs of computing-intensive applications in scientific and machine learning domains. However, there is typically a gap between the hardware capacity and the achievable application performance. This work aims to provide a better understanding of the Infinity Fabric interconnects on AMD GPUs and CPUs. We propose a test and evaluation methodology for characterizing the performance of data movements on multi-GPU systems, stressing different communication options on AMD MI250X GPUs, including point-to-point and collective communication, and memory allocation strategies between GPUs, as well as the host CPU. In a single-node setup with four GPUs, we show that direct peer-to-peer memory accesses between GPUs and utilization of the RCCL library outperform MPI-based solutions in terms of memory/communication latency and bandwidth. Our test and evaluation method serves as a base for validating memory and communication strategies on a system and improving applications on AMD multi-GPU computing systems.
翻译:现代 GPU 系统持续演进,以满足科学计算和机器学习领域对计算密集型应用的需求。然而,硬件能力与实际应用性能之间通常存在差距。本研究旨在深入理解 AMD GPU 与 CPU 上的 Infinity Fabric 互连技术。我们提出了一种测试与评估方法,用于表征多 GPU 系统上数据传输的性能,重点测试 AMD MI250X GPU 上的不同通信选项,包括点对点通信、集合通信、GPU 之间以及 GPU 与主机 CPU 之间的内存分配策略。在配备四块 GPU 的单节点设置中,我们证明,在内存/通信延迟和带宽方面,GPU 之间直接的对等内存访问以及 RCCL 库的使用优于基于 MPI 的解决方案。我们的测试与评估方法为验证系统上的内存与通信策略以及改进 AMD 多 GPU 计算系统上的应用奠定了基础。