Distributed ML workloads rely heavily on collective communication across multi-GPU, multi-node systems. Emerging scale-up fabrics, such as NVLink and UALink, enable direct memory access across nodes but introduce a critical destination-side translation step: translating Network Physical Addresses (NPAs) to System Physical Addresses (SPAs), which we term Reverse Translation (Reverse Address Translation). Despite its importance, the performance impact of Reverse Address Translation remains poorly understood. In this work, we present the first systematic study of Reverse Address Translation in large-scale GPU clusters. Using an extended ASTRA-sim framework with Omnet++ as the network backend, we model Link MMUs and Link TLBs and evaluate their effect on All-to-All collective communication across varying input sizes and GPU counts. Our analysis shows that cold TLB misses dominate latency for small, latency-sensitive collectives, causing up to 1.4x performance degradation, while larger collectives benefit from warmed caches and experience diminishing returns from over sized TLBs. Based on these observations, we propose two avenues for optimization: fused pre-translation kernels that overlap Reverse Address Translation with computation and software-guided TLB prefetching to proactively populate likely-needed entries. These techniques aim to hide translation latency, particularly for small collectives, improving throughput and scalability for inference workloads. Our study establishes a foundation for designing efficient destination-side translation mechanisms in large-scale multi-GPU systems.
翻译:分布式机器学习工作负载高度依赖跨多GPU、多节点系统的集合通信。新兴的扩展互联结构(如NVLink和UALink)支持节点间的直接内存访问,但引入了一个关键的目标端转换步骤:将网络物理地址(NPA)转换为系统物理地址(SPA),我们将其称为反向转换(反向地址转换)。尽管其重要性显著,但反向地址转换对性能的影响仍缺乏深入理解。本文首次对大规模GPU集群中的反向地址转换进行了系统性研究。通过基于Omnet++网络后端扩展ASTRA-sim框架,我们对链路MMU和链路TLB进行建模,并评估了它们在变化输入规模与GPU数量下对All-to-All集合通信的影响。分析表明:冷TLB缺失是导致小规模、延迟敏感型集合通信性能瓶颈的主要原因,可造成高达1.4倍的性能降级;而大规模集合通信得益于缓存预热,超量TLB带来的性能增益呈现边际递减。基于这些发现,我们提出两类优化方向:融合预转换内核(将反向地址转换与计算重叠处理)和软件引导的TLB预取(主动填充可能需要的条目)。这些技术旨在隐藏转换延迟(尤其针对小规模集合通信),从而提升推理工作负载的吞吐量与可扩展性。本研究为大规模多GPU系统中高效目标端转换机制的设计奠定了基础。