Recently, there has been increasing interest in efficient pretraining paradigms for training Transformer-based models. Several recent approaches use smaller models to initialize larger models in order to save computation (e.g., stacking and fusion). In this work, we study the fundamental question of how to select the best growing strategy from a given pool of growing strategies. Prior works have extensively focused on loss- and/or function-preserving behavior at initialization or simply performance at the end of training. Instead, we identify that behavior at initialization can be misleading as a predictor of final performance and present an alternative perspective based on early training dynamics, which we call "landscape-aware growing (LAG)". We perform extensive analysis of correlation of the final performance with performance in the initial steps of training and find early and more accurate predictions of the optimal growing strategy (i.e., with only a small "lag" after initialization). This perspective also motivates an adaptive strategy for gradual stacking.
翻译:近年来,针对基于Transformer模型的高效预训练范式日益受到关注。多项最新研究采用较小模型初始化较大模型以节省计算开销(例如堆叠与融合方法)。本研究致力于解决一个基础性问题:如何从给定的增长策略集合中选择最优策略。先前研究主要关注初始化阶段的损失保持性和/或函数保持性,或仅关注训练结束时的性能表现。我们发现,初始化阶段的行为可能对最终性能产生误导性预测,因此提出基于早期训练动态的替代视角,称为“景观感知增长(LAG)”。通过对最终性能与训练初始阶段性能的相关性进行广泛分析,我们发现仅需初始化后极短的“滞后”时间即可对最优增长策略实现更早且更精确的预测。这一视角同时启发了渐进式堆叠的自适应策略。