Pipeline parallelism enables training models that exceed single-device memory, but practical throughput remains limited by pipeline bubbles. Although parameter freezing can improve training throughput by adaptively skipping backward computation, existing methods often over-freeze parameters, resulting in unnecessary accuracy degradation. To address this issue, we propose TimelyFreeze, which models the pipeline schedule as a directed acyclic graph and solves a linear program to compute optimal freeze ratios that minimize batch execution time under accuracy constraints. Experiments show that TimelyFreeze achieves up to 40% training throughput improvement on LLaMA-8B with comparable accuracy. Overall, it enables faster large-scale model training without compromising convergence and generalizes across diverse pipeline-parallel settings.
翻译:流水线并行使得训练超出单设备内存容量的模型成为可能,但实际吞吐量仍受限于流水线气泡。尽管参数冻结能够通过自适应跳过反向计算来提升训练吞吐量,现有方法往往过度冻结参数,导致不必要的精度损失。为解决这一问题,我们提出TimelyFreeze,该方法将流水线调度建模为有向无环图,并通过求解线性规划问题来计算最优冻结比例,从而在精度约束下最小化批次执行时间。实验表明,TimelyFreeze在LLaMA-8B模型上实现了最高40%的训练吞吐量提升,同时保持相当的精度。总体而言,该方法能够在保证收敛性的前提下加速大规模模型训练,并适用于多种流水线并行配置。