Parameter-efficient fine-tuning approaches have recently garnered a lot of attention. Having considerably lower number of trainable weights, these methods can bring about scalability and computational effectiveness. In this paper, we look for optimal sub-networks and investigate the capability of different transformer modules in transferring knowledge from a pre-trained model to a downstream task. Our empirical results suggest that every transformer module in BERT can act as a winning ticket: fine-tuning each specific module while keeping the rest of the network frozen can lead to comparable performance to the full fine-tuning. Among different modules, LayerNorms exhibit the best capacity for knowledge transfer with limited trainable weights, to the extent that, with only 0.003% of all parameters in the layer-wise analysis, they show acceptable performance on various target tasks. On the reasons behind their effectiveness, we argue that their notable performance could be attributed to their high-magnitude weights compared to that of the other modules in the pre-trained BERT.
翻译:参数高效微调方法近期引起了广泛关注。由于可训练参数数量显著减少,这些方法能够提升模型的可扩展性与计算效率。本文旨在寻找最优子网络,并探究不同Transformer模块从预训练模型向下游任务迁移知识的能力。实证结果表明,BERT中的每个Transformer模块均可成为"中奖彩票":在冻结其余网络参数的情况下单独微调特定模块,其性能可与全参数微调相媲美。在各模块中,LayerNorm以极少的可训练参数展现出最佳知识迁移能力——在逐层分析中,仅利用0.003%的全局参数即可在多个目标任务上取得可接受表现。针对其有效性成因,我们认为其卓越性能可能源于预训练BERT中该模块相较其他模块具有更高量级的权重参数。