We present core attention disaggregation (CAD), a technique that improves long-context large language model training by decoupling the core attention computation, softmax(QK^T)V, from the rest of the model and executing it on a separate pool of devices. In existing systems, core attention is colocated with other layers; at long context lengths, its quadratic compute growth compared to the near-linear growth of other components causes load imbalance and stragglers across data and pipeline parallel groups. CAD is enabled by two observations. First, core attention is stateless: it has no trainable parameters and only minimal transient data, so balancing reduces to scheduling compute-bound tasks. Second, it is composable: modern attention kernels retain high efficiency when processing fused batches of token-level shards with arbitrary lengths. CAD partitions core attention into token-level tasks and dispatches them to dedicated attention servers, which dynamically rebatch tasks to equalize compute without sacrificing kernel efficiency. We implement CAD in a system called DistCA, which uses a ping-pong execution scheme to fully overlap communication with computation and in-place execution on attention servers to reduce memory use. On 512 H200 GPUs and context lengths up to 512k tokens, DistCA improves end-to-end training throughput by up to 1.35x, eliminates data and pipeline parallel stragglers, and achieves near-perfect compute and memory balance.
翻译:本文提出核心注意力解耦技术,该技术通过将核心注意力计算softmax(QK^T)V从模型其余部分解耦,并在独立的设备池上执行,从而改进长上下文大语言模型的训练。在现有系统中,核心注意力与其他层共置;在长上下文场景下,其二次计算增长与其他组件的近线性增长形成对比,导致数据和流水线并行组间负载不均与拖尾现象。CAD技术的可行性基于两项观察:首先,核心注意力是无状态的——它不含可训练参数且仅需极少量瞬态数据,因此负载均衡可简化为计算密集型任务的调度问题;其次,它具有可组合性——现代注意力内核在处理任意长度的令牌级分片融合批次时仍能保持高效率。CAD将核心注意力划分为令牌级任务并分派至专用注意力服务器,这些服务器动态重组任务以实现计算均衡,同时不损失内核效率。我们在名为DistCA的系统中实现了CAD,该系统采用乒乓执行方案使通信与计算完全重叠,并通过注意力服务器上的就地执行降低内存占用。在512张H200 GPU上处理长达512k令牌的上下文时,DistCA将端到端训练吞吐量提升最高达1.35倍,消除了数据与流水线并行中的拖尾现象,并实现了近乎完美的计算与内存平衡。