It is notoriously difficult to train Transformers on small datasets; typically, large pre-trained models are instead used as the starting point. We explore the weights of such pre-trained Transformers (particularly for vision) to attempt to find reasons for this discrepancy. Surprisingly, we find that simply initializing the weights of self-attention layers so that they "look" more like their pre-trained counterparts allows us to train vanilla Transformers faster and to higher final accuracies, particularly on vision tasks such as CIFAR-10 and ImageNet classification, where we see gains in accuracy of over 5% and 4%, respectively. Our initialization scheme is closed form, learning-free, and very simple: we set the product of the query and key weights to be approximately the identity, and the product of the value and projection weights to approximately the negative identity. As this mimics the patterns we saw in pre-trained Transformers, we call the technique "mimetic initialization".
翻译:在小型数据集上训练Transformer通常极其困难;通常的做法是使用大型预训练模型作为起点。我们探索了此类预训练Transformer(特别是视觉领域)的权重,试图找出这一差距的原因。令人惊讶的是,我们发现仅需对自注意力层的权重进行初始化,使其“看起来”更像预训练模型的对应权重,就能使普通Transformer训练更快,且最终准确率更高——尤其在CIFAR-10和ImageNet分类等视觉任务中,我们分别观察到超过5%和4%的精度提升。我们的初始化方案是闭式的、无需学习且极其简单:将查询权重与键权重的乘积近似设为恒等矩阵,并将值权重与投影权重的乘积近似设为负恒等矩阵。由于这种方法模仿了预训练Transformer中的模式,我们将其称为“模仿初始化”。