Pre-training large language models (LLMs) increasingly requires distributed compute, yet bandwidth constraints make it difficult to scale beyond well-provisioned datacenters-especially when model parallelism forces frequent, large inter-device communications. We study whether SparseLoCo, a low-communication data parallel method based on infrequent synchronization and sparse pseudo-gradient exchange, can be combined with low-bandwidth pipeline model parallelism via activation and activation-gradient compression. We introduce a heterogeneous distributed training framework where some participants host full replicas on high-bandwidth interconnects, while resource-limited participants are grouped to jointly instantiate a replica using pipeline parallelism with subspace-projected inter-stage communication. To make the recently introduced subspace pipeline compression compatible with SparseLoCo, we study a number of adaptations. Across large-scale language modeling experiments (178M-1B parameters) on standard pretraining corpora, we find that activation compression composes with SparseLoCo at modest cost, while selective (heterogeneous) compression consistently improves the loss-communication tradeoff relative to compressing all replicas-especially at aggressive compression ratios. These results suggest a practical path to incorporating low-bandwidth model parallelism and heterogeneous participants into LLM pre-training.
翻译:预训练大型语言模型(LLMs)日益需要分布式计算,然而带宽限制使得在资源充足的数据中心之外难以扩展——尤其是当模型并行化迫使设备间频繁进行大规模通信时。本研究探讨了SparseLoCo(一种基于低频同步和稀疏伪梯度交换的低通信数据并行方法)是否能够通过激活值及激活梯度压缩与低带宽流水线模型并行化相结合。我们提出了一种异构分布式训练框架,其中部分参与者在高带宽互连网络上托管完整模型副本,而资源受限的参与者则通过采用子空间投影级间通信的流水线并行化方式分组联合实例化模型副本。为使近期提出的子空间流水线压缩方法与SparseLoCo兼容,我们研究了多种适配方案。通过在标准预训练语料上进行大规模语言建模实验(1.78亿至10亿参数),我们发现激活压缩能以较小代价与SparseLoCo协同工作,而选择性(异构)压缩相较于压缩所有副本,能持续改善损失-通信权衡关系——在激进压缩比下尤为显著。这些结果表明了将低带宽模型并行化与异构参与者纳入LLM预训练的实际可行路径。