Pipeline parallelism (PP) is widely used for training large language models (LLMs), yet its scalability is often constrained by high activation memory consumption as the number of in-flight microbatches grows with the degree of PP. In this paper, we focus on addressing this challenge by leveraging the under-explored memory offload strategy in PP. With empirical study, we discover that in the majority of standard configurations, at least half, and potentially all, of the activations can be offloaded with negligible overhead. In the cases where full overload is not possible, we introduce a novel selective offload strategy that decreases peak activation memory in a better-than-linear manner. Furthermore, we integrate memory offload with other techniques to jointly consider overall throughput and memory limitation. Our experiments proves that the per-device activation memory effectively reduces with the total number of stages, making PP a stronger alternative than TP, offering up to a 19\% acceleration with even lower memory consumption. The implementation is open-sourced at \href{https://github.com/sail-sg/zero-bubble-pipeline-parallelism}{this url}.
翻译:流水线并行(Pipeline Parallelism,PP)被广泛用于训练大型语言模型(Large Language Models, LLMs),但其可扩展性常受限于高激活内存消耗,因为动态微批次的数量随PP程度增加而增长。本文聚焦于通过利用PP中尚未充分探索的内存卸载策略来应对这一挑战。通过实证研究,我们发现,在大多数标准配置中,至少一半乃至全部的激活值都可以被卸载,且开销可忽略不计。在无法完全卸载的情况下,我们引入了一种新颖的选择性卸载策略,该策略能以优于线性的方式降低峰值激活内存。此外,我们将内存卸载与其他技术结合,以共同权衡整体吞吐量与内存限制。我们的实验证明,单设备激活内存随流水线总阶段数的增加而有效降低,使得PP成为比张量并行(TP)更具优势的替代方案,在内存消耗更低的情况下可实现高达19%的加速。实现代码已在\href{https://github.com/sail-sg/zero-bubble-pipeline-parallelism}{此链接}开源。