The rapid progress of Transformers in artificial intelligence has come at the cost of increased resource consumption and greenhouse gas emissions due to growing model sizes. Prior work suggests using pretrained small models to improve training efficiency, but this approach may not be suitable for new model structures. On the other hand, training from scratch can be slow, and progressively stacking layers often fails to achieve significant acceleration. To address these challenges, we propose a novel method called Apollo, which prep\textbf{a}res lessons for ex\textbf{p}anding \textbf{o}perations by \textbf{l}earning high-\textbf{l}ayer functi\textbf{o}nality during training of low layers. Our approach involves low-value-prioritized sampling (LVPS) to train different depths and weight sharing to facilitate efficient expansion. We also introduce an interpolation method for stable model depth extension. Experiments demonstrate that Apollo achieves state-of-the-art acceleration ratios, even rivaling methods using pretrained models, making it a universal and efficient solution for training deep models while reducing time, financial, and environmental costs.
翻译:Transformer在人工智能领域的快速进步伴随着模型规模增大带来的资源消耗与温室气体排放增加。现有研究建议使用预训练小模型来提升训练效率,但该方法可能不适用于新型模型结构。同时,从头训练效率低下,而渐进式堆叠层数往往难以实现显著加速。为解决这些问题,我们提出名为Apollo的新方法,该方法通过在低层训练期间学习高层功能,为扩展操作预做准备。我们采用低值优先采样(LVPS)训练不同深度,并利用权重共享促进高效扩展。此外,我们引入一种插值方法以实现稳定的模型深度扩展。实验表明,Apollo达到了最先进的加速比,甚至可与使用预训练模型的方法相媲美,为训练深层模型提供了通用高效的解决方案,同时降低了时间、经济和环境成本。