Fast Catch-Up, Late Switching: Optimal Batch Size Scheduling via Functional Scaling Laws

Batch size scheduling (BSS) plays a critical role in large-scale deep learning training, influencing both optimization dynamics and computational efficiency. Yet, its theoretical foundations remain poorly understood. In this work, we show that the functional scaling law (FSL) framework introduced in Li et al. (2025a) provides a principled lens for analyzing BSS. Specifically, we characterize the optimal BSS under a fixed data budget and show that its structure depends sharply on task difficulty. For easy tasks, optimal schedules keep increasing batch size throughout. In contrast, for hard tasks, the optimal schedule maintains small batch sizes for most of training and switches to large batches only in a late stage. To explain the emergence of late switching, we uncover a dynamical mechanism -- the fast catch-up effect -- which also manifests in large language model (LLM) pretraining. After switching from small to large batches, the loss rapidly aligns with the constant large-batch trajectory. Using FSL, we show that this effect stems from rapid forgetting of accumulated gradient noise, with the catch-up speed determined by task difficulty. Crucially, this effect implies that large batches can be safely deferred to late training without sacrificing performance, while substantially reducing data consumption. Finally, extensive LLM pretraining experiments -- covering both Dense and MoE architectures with up to 1.1B parameters and 1T tokens -- validate our theoretical predictions. Across all settings, late-switch schedules consistently outperform constant-batch and early-switch baselines.

翻译：批量大小调度（BSS）在大规模深度学习训练中起着关键作用，既影响优化动态，也影响计算效率。然而，其理论基础仍不甚明晰。本工作中，我们表明，Li等人（2025a）引入的函数缩放律（FSL）框架为分析BSS提供了一个原则性的视角。具体而言，我们在固定数据预算下刻画了最优BSS，并证明其结构高度依赖于任务难度。对于简单任务，最优调度在整个训练过程中持续增加批量大小。相反，对于困难任务，最优调度在大部分训练时间内保持小批量，仅在晚期阶段切换至大批量。为解释晚期切换的出现，我们揭示了一种动态机制——快速追赶效应——该效应在大语言模型（LLM）预训练中同样显现。在从小批量切换至大批量后，损失值会迅速与恒定大批量轨迹对齐。利用FSL，我们证明该效应源于累积梯度噪声的快速遗忘，其追赶速度由任务难度决定。至关重要的是，这一效应意味着大批量可以安全地推迟到训练后期使用，而不会牺牲性能，同时大幅减少数据消耗。最后，广泛的LLM预训练实验——涵盖参数规模高达11亿、训练语料达1万亿token的Dense与MoE架构——验证了我们的理论预测。在所有设置中，晚期切换调度均一致优于恒定批量与早期切换基线。