Parameter-efficient fine-tuning approaches have recently garnered a lot of attention. Having considerably lower number of trainable weights, these methods can bring about scalability and computational effectiveness. In this paper, we look for optimal sub-networks and investigate the capability of different transformer modules in transferring knowledge from a pre-trained model to a downstream task. Our empirical results suggest that every transformer module in BERT can act as a winning ticket: fine-tuning each specific module while keeping the rest of the network frozen can lead to comparable performance to the full fine-tuning. Among different modules, LayerNorms exhibit the best capacity for knowledge transfer with limited trainable weights, to the extent that, with only 0.003% of all parameters in the layer-wise analysis, they show acceptable performance on various target tasks. On the reasons behind their effectiveness, we argue that their notable performance could be attributed to their high-magnitude weights compared to that of the other modules in the pre-trained BERT.
翻译:参数高效微调方法近年来引起了广泛关注。由于可训练权重的数量显著减少,这些方法能够带来可扩展性和计算效率的提升。本文旨在寻找最优子网络,并探究不同Transformer模块在将预训练模型知识迁移至下游任务中的能力。我们的实证结果表明,BERT中的每个Transformer模块均可作为胜出票据(winning ticket):在冻结网络其余部分的同时,对每个特定模块进行微调,可获得与全参数微调相当的性能。在不同模块中,LayerNorm以有限的可训练权重展现出最佳的知识迁移能力,以至于在分层分析中仅使用全部参数的0.003%,便能在多种目标任务上取得可接受的性能。关于其有效性的原因,我们认为LayerNorm的显著表现可归因于其权重幅值高于预训练BERT中其他模块的权重。