Training large transformer models from scratch for a target task requires lots of data and is computationally demanding. The usual practice of transfer learning overcomes this challenge by initializing the model with weights of a pretrained model of the same size and specification to increase the convergence and training speed. However, what if no pretrained model of the required size is available? In this paper, we introduce a simple yet effective technique to transfer the knowledge of a pretrained model to smaller variants. Our approach called weight subcloning expedites the training of scaled-down transformers by initializing their weights from larger pretrained models. Weight subcloning involves an operation on the pretrained model to obtain the equivalent initialized scaled-down model. It consists of two key steps: first, we introduce neuron importance ranking to decrease the embedding dimension per layer in the pretrained model. Then, we remove blocks from the transformer model to match the number of layers in the scaled-down network. The result is a network ready to undergo training, which gains significant improvements in training speed compared to random initialization. For instance, we achieve 4x faster training for vision transformers in image classification and language models designed for next token prediction.
翻译:从头训练适用于特定任务的大型Transformer模型需要大量数据且计算成本高昂。通常的迁移学习实践通过使用相同规模和规格的预训练模型权重初始化模型来克服这一挑战,从而提升收敛速度和训练效率。然而,若所需规模的预训练模型不可用时该怎么办?本文提出一种简单而有效的技术,可将预训练模型知识迁移至更小的变体。我们的方法名为权重子克隆,通过从较大预训练模型中初始化权重来加速缩小型Transformer的训练。权重子克隆涉及对预训练模型进行操作以获得等效的初始化缩小型模型,包含两个关键步骤:首先引入神经元重要性排序以降低预训练模型每层的嵌入维度,随后移除Transformer模型中的模块以匹配缩小型网络的层数。最终得到可直接投入训练的网络,相较于随机初始化在训练速度上取得显著提升。例如,我们在图像分类的视觉Transformer及用于下一个标记预测的语言模型中实现了4倍训练加速。