Large-scale model training increasingly relies on composing multiple parallelism strategies, such as data, pipeline, and expert parallelism, together with memory-saving optimizations like ZeRO. Deployed systems for foundation model pretraining often rely on human experts to manually design a high-level parallelism strategy then implement the corresponding low-level execution strategy, making it difficult to adapt the system to new strategies. Meanwhile, many general-purpose frameworks are more flexible but their implementations are still tied to a fixed set of common parallelism strategies, making it challenging to integrate state-of-the-art strategies. We present Piper, a user-controllable distributed training system that decouples the strategy from the runtime implementation. Piper allows users to declare a comprehensive distributed training strategy with a small set of model annotations and scheduling directives. Each directive applies a transformation on Piper's intermediate representation (IR), a unified global training DAG that represents all computation and communication. Using this IR, Piper compiles per-device execution plans and executes them with a distributed runtime agnostic to the strategy. We show that the combined system maintains performance parity on commonly available strategies such as ZeRO, while also enabling additional performance and memory efficiency gains through joint scheduling of compute and communication in composed parallelism strategies such as DeepSeek-V3's DualPipe.
翻译:摘要:大规模模型训练日益依赖于将多种并行策略(如数据并行、流水线并行和专家并行)与节省内存的优化技术(如ZeRO)组合使用。现有基础模型预训练部署系统通常依赖人类专家手动设计高层级并行策略,再实现相应的底层执行策略,这使得系统难以适配新型策略。与此同时,许多通用框架虽更具灵活性,但其实现仍受限于一套固定的常见并行策略,从而导致整合先进策略变得困难。我们提出Piper——一种用户可控的分布式训练系统,它将策略与运行时实现解耦。Piper允许用户通过少量模型注解和调度指令声明一个综合性的分布式训练策略。每条指令对Piper的中间表示(IR)施加转换,该中间表示是一个统一表示所有计算与通信的全局训练有向无环图(DAG)。基于该中间表示,Piper编译出每设备执行计划,并通过与策略无关的分布式运行时执行这些计划。我们证明,该组合系统在常见策略(如ZeRO)上保持性能等价的同时,还能通过计算与通信的联合调度(如DeepSeek-V3的DualPipe等组合并行策略)实现额外的性能与内存效率提升。