Pipeline parallelism enables training models that exceed single-device memory, but practical throughput remains limited by pipeline bubbles. Although parameter freezing can improve training throughput by adaptively skipping backward computation, existing methods often over-freeze parameters, resulting in unnecessary accuracy degradation. To address this issue, we propose TimelyFreeze, which models the pipeline schedule as a directed acyclic graph and solves a linear program to compute optimal freeze ratios that minimize batch execution time under accuracy constraints. Experiments show that TimelyFreeze achieves up to 40% training throughput improvement on LLaMA-8B with comparable accuracy. Overall, it enables faster large-scale model training without compromising convergence and generalizes across diverse pipeline-parallel settings.
翻译:流水线并行支持训练超出单设备内存容量的模型,但实际吞吐量仍受限于流水线气泡。尽管参数冻结可通过自适应跳过反向计算来提升训练吞吐量,现有方法往往过度冻结参数,导致不必要的精度损失。为解决该问题,本文提出TimelyFreeze方法,将流水线调度建模为有向无环图,并通过求解线性规划问题来计算最优冻结比例,从而在精度约束下最小化批次执行时间。实验表明,TimelyFreeze在LLaMA-8B模型上实现了最高40%的训练吞吐量提升,同时保持相当精度。总体而言,该方法能在不影响收敛性的前提下加速大规模模型训练,并适用于多种流水线并行场景。