Training large language models requires jointly configuring two interdependent aspects of the system: the global batch size, which governs statistical efficiency, and the 3D parallelism strategy, which governs hardware throughput. Existing approaches make these decisions independently: optimization work adapts the batch size to track the evolving critical batch size while keeping parallelism fixed, and systems work selects the fastest parallelism for a given fixed batch size without anticipating that the optimal batch size could change. We show that these decisions are tightly coupled: the throughput-optimal parallelism strategy may shift as the global batch size changes, so any method that fixes one while adapting the other operates with a suboptimal configuration for part of the training run. We present COPUS, a system that adaptively tunes the global batch size, parallelism strategy, and micro-batch size as training evolves. COPUS is guided by Goodput, the product of throughput and statistical efficiency, which models both hardware and statistical effects jointly and directly measures useful convergence per unit of wall-clock time. The system combines online gradient noise scale estimation under 3D parallelism with throughput-aware evaluation of candidate configurations, and supports efficient reconfiguration of both batch size and parallelism during training. We evaluate COPUS on LLM pre-training workloads across 1-4 nodes of 8xH100 and 8xMI210 GPUs and model sizes from 3B to 32B parameters, demonstrating average time-to-convergence speedups of 3.9-8.0% over the fastest baseline across four configurations, with peak gains up to 11.1%, including system overheads.
翻译:训练大语言模型需要联合配置系统中两个相互依赖的要素:全局批次大小(决定统计效率)和三维并行策略(决定硬件吞吐量)。现有方法独立处理这些决策:优化工作通过调整批次大小来追踪不断演化的临界批次大小,同时保持并行度固定;而系统工作则针对给定的固定批次大小选择最快的并行策略,却未能预见到最优批次大小可能发生改变。我们证明这些决策存在紧密耦合:吞吐量最优的并行策略可能随全局批次大小变化而改变,因此任何固定一个要素而调整另一个要素的方法都会在部分训练过程中采用次优配置。我们提出COPUS系统,该系统能在训练过程中自适应调整全局批次大小、并行策略和微批次大小。COPUS以有效吞吐(吞吐量与统计效率的乘积)为指导,该指标联合建模硬件性能和统计效应,直接度量每单位壁钟时间的有用收敛进度。系统将三维并行下的在线梯度噪声尺度估计与考虑吞吐量的候选配置评估相结合,并支持训练过程中对批次大小和并行度的高效重配置。我们在1-4节点(搭载8xH100和8xMI210 GPU)上评估了COPUS在3B至32B参数规模的大语言模型预训练任务中的表现,结果表明与四种配置下最快的基线相比,平均收敛时间加速3.9-8.0%,峰值加速达11.1%(包含系统开销)。