Fast Catch-Up, Late Switching: Optimal Batch Size Scheduling via Functional Scaling Laws

Batch size scheduling (BSS) plays a critical role in large-scale deep learning training, influencing both optimization dynamics and computational efficiency. Yet, its theoretical foundations remain poorly understood. In this work, we show that the functional scaling law (FSL) framework introduced in Li et al. (2025a) provides a principled lens for analyzing BSS. Specifically, we characterize the optimal BSS under a fixed data budget and show that its structure depends sharply on task difficulty. For easy tasks, optimal schedules keep increasing batch size throughout. In contrast, for hard tasks, the optimal schedule maintains small batch sizes for most of training and switches to large batches only in a late stage. To explain the emergence of late switching, we uncover a dynamical mechanism -- the fast catch-up effect -- which also manifests in large language model (LLM) pretraining. After switching from small to large batches, the loss rapidly aligns with the constant large-batch trajectory. Using FSL, we show that this effect stems from rapid forgetting of accumulated gradient noise, with the catch-up speed determined by task difficulty. Crucially, this effect implies that large batches can be safely deferred to late training without sacrificing performance, while substantially reducing data consumption. Finally, extensive LLM pretraining experiments -- covering both Dense and MoE architectures with up to 1.1B parameters and 1T tokens -- validate our theoretical predictions. Across all settings, late-switch schedules consistently outperform constant-batch and early-switch baselines.

翻译：批大小调度在大规模深度学习训练中起着关键作用，既影响优化动态也影响计算效率。然而，其理论基础仍不甚明晰。本工作中，我们证明 Li 等人（2025a）引入的函数缩放律框架为分析批大小调度提供了原则性视角。具体而言，我们在固定数据预算下刻画了最优批大小调度，并表明其结构强烈依赖于任务难度。对于简单任务，最优调度在整个训练过程中持续增大批大小。相反，对于困难任务，最优调度在大部分训练时间内保持小批大小，仅在后期阶段切换至大批量。为解释晚期切换现象的出现，我们揭示了一种动态机制——快速追赶效应——该效应同样在大语言模型预训练中显现。从小批量切换至大批量后，损失值迅速与恒定大批量轨迹对齐。利用函数缩放律，我们证明该效应源于累积梯度噪声的快速遗忘，其追赶速度由任务难度决定。关键的是，该效应意味着大批量训练可安全推迟至训练后期而无需牺牲性能，同时显著减少数据消耗。最后，广泛的大语言模型预训练实验——涵盖参数量高达 11 亿、训练语料达 1 万亿标记的 Dense 与 MoE 架构——验证了我们的理论预测。在所有实验设置中，晚期切换调度均一致优于恒定批大小与早期切换基线。