Efficiently processing long sequences with Transformer models usually requires splitting the computations across accelerators via context parallelism. The dominant approaches in this family of methods, such as Ring Attention or DeepSpeed Ulysses, enable scaling over the context dimension but do not focus on memory efficiency, which limits the sequence lengths they can support. More advanced techniques, such as Fully Pipelined Distributed Transformer or activation offloading, can further extend the possible context length at the cost of training throughput. In this paper, we present UPipe, a simple yet effective context parallelism technique that performs fine-grained chunking at the attention head level. This technique significantly reduces the activation memory usage of self-attention, breaking the activation memory barrier and unlocking much longer context lengths. Our approach reduces intermediate tensor memory usage in the attention layer by as much as 87.5$\%$ for 32B Transformers, while matching previous context parallelism techniques in terms of training speed. UPipe can support the context length of 5M tokens when training Llama3-8B on a single 8$\times$H100 node, improving upon prior methods by over 25$\%$.
翻译:使用Transformer模型高效处理长序列通常需要通过上下文并行将计算任务分配到多个加速器上。该领域的主流方法,如环形注意力或DeepSpeed Ulysses,虽能实现上下文维度的扩展,但未着重优化内存效率,从而限制了其支持的序列长度。更先进的技术,如全流水线分布式Transformer或激活卸载,虽能进一步扩展可能的上下文长度,但会牺牲训练吞吐量。本文提出UPipe,一种简单而有效的上下文并行技术,其在注意力头级别执行细粒度分块。该技术显著降低了自注意力层的激活内存占用,突破了激活内存瓶颈,从而支持更长的上下文长度。对于320亿参数的Transformer模型,我们的方法将注意力层中间张量的内存使用量降低了高达87.5%,同时在训练速度上与先前的上下文并行技术相当。在单个8×H100节点上训练Llama3-8B模型时,UPipe可支持500万标记的上下文长度,较先前方法提升了超过25%。