Pipeline parallelism is widely used to scale the training of transformer-based large language models, various works have been done to improve its throughput and memory footprint. In this paper, we address a frequently overlooked issue: the vocabulary layers can cause imbalanced computation and memory usage across pipeline stages, worsening pipeline bubbles and the memory bottleneck. To tackle this, we partition the vocabulary layers evenly across pipeline devices and group the computation into pipeline passes. To reduce the activation memory overhead, we propose several algorithms to reduce communication barriers within vocabulary layers. Additionally, we utilize a generalizable method to integrate Vocabulary Parallelism with existing pipeline schedules. By combining these techniques, our methods effectively balance the computation and parameter memory, with only a small constant activation memory overhead. Notably, when combined with activation memory-balanced schedules like V-Half, our approach achieves perfect balance in both memory and computation. Extensive evaluations demonstrate that our method achieves computation and memory balance regardless of the vocabulary size, resulting in a 5% to 51% improvement in throughput compared to naive approaches, meanwhile significantly reducing peak memory usage especially for large vocabulary scenarios. Our implementation is open-sourced at https://github.com/sail-sg/VocabularyParallelism .
翻译:流水线并行被广泛用于扩展基于Transformer的大语言模型的训练,已有诸多工作致力于提升其吞吐量和降低内存占用。本文针对一个常被忽视的问题展开研究:词汇表层可能导致流水线各阶段间的计算与内存使用不均衡,从而加剧流水线气泡效应和内存瓶颈。为解决此问题,我们将词汇表层均匀划分到流水线设备上,并将计算组织为流水线传递。为减少激活内存开销,我们提出了多种算法以降低词汇表层内部的通信屏障。此外,我们采用一种通用化方法将词汇表并行与现有流水线调度方案相集成。通过结合这些技术,我们的方法有效平衡了计算与参数内存,仅引入微小的恒定激活内存开销。值得注意的是,当与V-Half等激活内存平衡调度方案结合时,我们的方法实现了内存与计算的完美均衡。大量实验评估表明,无论词汇表规模如何,我们的方法均能实现计算与内存的平衡,相比朴素方法吞吐量提升5%至51%,同时显著降低了峰值内存使用量,尤其在大词汇表场景下效果更为明显。我们的实现已在https://github.com/sail-sg/VocabularyParallelism开源。