Training large language models requires distributing computation across many accelerators, yet practitioners select parallelism strategies (data, tensor, pipeline, ZeRO) through trial and error because no unified systematic framework predicts their behavior. We introduce placement semantics: each strategy is specified by how it places four training states (parameters, optimizer, gradients, activations) across devices using five modes (replicated, sharded, sharded-with-gather, materialized, offloaded). From placement alone, without implementation details, we derive memory consumption and communication volume. Our predictions match published results exactly: ZeRO-3 uses 8x less memory than data parallelism at 1.5x communication cost, as reported in the original paper. We prove two conditions (gradient integrity, state consistency) are necessary and sufficient for distributed training to match single-device results, and provide composition rules for combining strategies safely. The framework unifies ZeRO Stages 1-3, Fully Sharded Data Parallel (FSDP), tensor parallelism, and pipeline parallelism as instances with different placement choices.
翻译:训练大型语言模型需要将计算分布到多个加速器上,然而实践者通常通过试错来选择并行策略(数据并行、张量并行、流水线并行、ZeRO),因为缺乏统一的系统化框架来预测其行为。我们提出放置语义学:每种策略通过使用五种模式(复制、分片、分片-聚合、物化、卸载)将四种训练状态(参数、优化器、梯度、激活值)放置于设备上的方式来定义。仅从放置方式出发,无需实现细节,我们即可推导出内存消耗与通信量。我们的预测与已发表结果完全吻合:如原论文所述,ZeRO-3 在通信成本增加 1.5 倍的情况下,比数据并行减少 8 倍内存使用。我们证明两个条件(梯度完整性、状态一致性)是分布式训练结果与单设备结果匹配的充分必要条件,并提供了安全组合策略的复合规则。该框架将 ZeRO 第 1-3 阶段、全分片数据并行(FSDP)、张量并行和流水线并行统一为具有不同放置选择的实例。