Scaling of deep neural networks, especially Transformers, is pivotal for their surging performance and has further led to the emergence of sophisticated reasoning capabilities in foundation models. Such scaling generally requires training large models from scratch with random initialization, failing to leverage the knowledge acquired by their smaller counterparts, which are already resource-intensive to obtain. To tackle this inefficiency, we present $\textbf{L}$ossl$\textbf{E}$ss $\textbf{MO}$del Expansio$\textbf{N}$ (LEMON), a recipe to initialize scaled models using the weights of their smaller but pre-trained counterparts. This is followed by model training with an optimized learning rate scheduler tailored explicitly for the scaled models, substantially reducing the training time compared to training from scratch. Notably, LEMON is versatile, ensuring compatibility with various network structures, including models like Vision Transformers and BERT. Our empirical results demonstrate that LEMON reduces computational costs by 56.7% for Vision Transformers and 33.2% for BERT when compared to training from scratch.
翻译:深度神经网络的扩展,特别是Transformer,对于其性能的飞速提升至关重要,并进一步催生了基础模型中复杂的推理能力。这种扩展通常需要从随机初始化开始从头训练大型模型,未能利用其已耗费大量资源获得的小型模型所积累的知识。为了解决这一效率低下的问题,我们提出了$\textbf{L}$ossl$\textbf{E}$ss $\textbf{MO}$del Expansio$\textbf{N}$(LEMON),一种利用已预训练的小型模型权重来初始化扩展模型的方法。随后,我们采用专为扩展模型优化的学习率调度器进行模型训练,与从头训练相比,显著减少了训练时间。值得注意的是,LEMON具有通用性,可兼容多种网络结构,包括像Vision Transformers和BERT等模型。我们的实验结果表明,与从头训练相比,LEMON在Vision Transformers上降低了56.7%的计算成本,在BERT上降低了33.2%的计算成本。