T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives

Large Language Models increasingly rely on distributed techniques for their training and inference. These techniques require communication across devices which can reduce scaling efficiency as the number of devices increases. While some distributed techniques can overlap, and thus, hide this communication with independent computations, techniques such as Tensor Parallelism (TP) inherently serialize communication with model execution. One approach to hide this serialized communication is to interleave it with the producer operation (of the communicated data) in a fine-grained manner. However, this fine-grained interleaving of communication and computation in software can be difficult. Furthermore, as with any concurrent execution, it requires compute and memory resources to be shared between computation and communication, causing resource contention that reduces overlapping efficacy. To overcome these challenges, we propose T3 which applies hardware-software co-design to transparently overlap serialized communication while minimizing resource contention with compute. T3 transparently fuses producer operations with the subsequent communication via a simple configuration of the producer's output address space and requires minor software changes. At the hardware level, T3 adds a lightweight track and trigger mechanism to orchestrate the producer's compute, and communication. It further uses compute-enhanced memories for communication's attendant compute. As a result, T3 reduces resource contention, and efficiently overlaps serialized communication with computation. For important Transformer models like T-NLG, T3 speeds up communication-heavy sublayers by 30% geomean (max 47%) and reduces data movement by 22% geomean (max 36%). Furthermore, T3's benefits persist as models scale: geomean 29% for sublayers in $\sim$500-billion parameter models, PALM and MT-NLG.

翻译：大型语言模型日益依赖分布式技术进行训练与推理。这些技术需要跨设备通信，随着设备数量增加，会降低扩展效率。尽管部分分布式技术可将通信与独立计算进行重叠（从而隐藏通信开销），但张量并行（TP）等技术的通信与模型执行本质上呈串行化。一种隐藏此类串行化通信的方法是将通信与生产者操作（即产生通信数据的操作）进行细粒度交织。然而，在软件层面实现这种通信与计算的细粒度交织具有较高难度。此外，与任何并发执行类似，计算与通信需要共享计算和内存资源，这会导致资源竞争，降低重叠效率。为应对这些挑战，我们提出T3，采用硬件-软件协同设计方法，在最小化计算资源竞争的前提下，透明地实现串行化通信的重叠。T3通过简单配置生产者输出地址空间，将生产者操作与后续通信透明融合，仅需少量软件修改。在硬件层面，T3新增轻量级追踪与触发机制，协调生产者的计算与通信操作，并采用增强型计算内存来处理通信伴随的计算任务。实验表明，T3显著降低资源竞争，高效实现串行化通信与计算的重叠。在T-NLG等重要Transformer模型中，T3将通信密集型子层加速30%（几何均值，最高47%），减少22%数据移动（几何均值，最高36%）。此外，T3的优势随模型规模扩展持续存在：在约5000亿参数的PALM与MT-NLG模型中，其子层几何平均加速达29%。