As large language models continue to scale, training demands on compute and system capacity grow rapidly, making single-vendor homogeneous clusters insufficient. This paper presents a technical solution for heterogeneous mixed training in AMD-NVIDIA environments. We first adopt a compatibility-oriented approach based on CPU-Forwarding Communication, with differentiated communication back-end selection across parallel groups and multi-NIC parallel data transfer. To achieve higher performance, we further propose another Device-Direct Communication approach, integrating a CPU-offloading P2P mechanism to enable direct cross-vendor GPU data transfer without host-memory staging. Experiments on LLaMA-8B and Qwen2-7B demonstrate that the proposed Device-Direct Communication approach achieves up to 98% of the throughput of an NVIDIA homogeneous system, while preserving training stability and correctness.
翻译:随着大语言模型规模持续扩大,对算力和系统容量的训练需求快速增长,使得单一供应商的同构集群已显不足。本文提出了一种在AMD-NVIDIA异构环境中的混合训练技术方案。我们首先采用基于CPU转发通信的兼容性导向方法,通过跨并行组的差异化通信后端选择和多网卡并行数据传输实现基础功能。为追求更高性能,我们进一步提出另一种设备直连通信方法,集成CPU卸载的点对点传输机制,无需经过主机内存中转即可实现跨供应商GPU间的直接数据传输。在LLaMA-8B和Qwen2-7B上的实验表明,所提出的设备直连通信方法最高可达NVIDIA同构系统98%的吞吐量,同时保持训练稳定性与正确性。