Make Your Pre-trained Model Reversible: From Parameter to Memory Efficient Fine-Tuning

Parameter-efficient fine-tuning (PEFT) of pre-trained language models (PLMs) has emerged as a highly successful approach, with training only a small number of parameters without sacrificing performance and becoming the de-facto learning paradigm with the increasing size of PLMs. However, existing PEFT methods are not memory-efficient, because they still require caching most of the intermediate activations for the gradient calculation, akin to fine-tuning. One effective way to reduce the activation memory is to apply a reversible model, so the intermediate activations are not necessary to be cached and can be recomputed. Nevertheless, modifying a PLM to its reversible variant with PEFT is not straightforward, since the reversible model has a distinct architecture from the currently released PLMs. In this paper, we first investigate what is a key factor for the success of existing PEFT methods, and realize that it's essential to preserve the PLM's starting point when initializing a PEFT method. With this finding, we propose memory-efficient fine-tuning (MEFT) that inserts adapters into a PLM, preserving the PLM's starting point and making it reversible without additional pre-training. We evaluate MEFT on the GLUE benchmark and five question-answering tasks with various backbones, BERT, RoBERTa, BART and OPT. MEFT significantly reduces the activation memory up to 84% of full fine-tuning with a negligible amount of trainable parameters. Moreover, MEFT achieves the same score on GLUE and a comparable score on the question-answering tasks as full fine-tuning.

翻译：参数高效微调（PEFT）预训练语言模型已成为一种高度成功的方法，该方法仅需训练少量参数即可保持性能，且随着预训练语言模型规模不断扩大，已成为事实上的学习范式。然而，现有PEFT方法并非内存高效，因为它们与微调类似，仍需缓存大部分中间激活值用于梯度计算。降低激活内存的有效途径之一是采用可逆模型，这样中间激活值无需缓存，可被重新计算。但将预训练语言模型直接修改为具有PEFT能力的可逆变体并非易事，因为可逆模型与当前发布的预训练语言模型架构存在显著差异。本文首先探究了现有PEFT方法成功的关键因素，认识到在初始化PEFT方法时保留预训练语言模型的起点至关重要。基于这一发现，我们提出内存高效微调（MEFT），该方法将适配器插入预训练语言模型，既保留模型起点，又使其无需额外预训练即可实现可逆性。我们在GLUE基准测试和五个问答任务上评估了MEFT，实验采用BERT、RoBERTa、BART和OPT等不同骨干网络。与完整微调相比，MEFT将激活内存显著降低至多84%，且仅需极少量可训练参数。此外，MEFT在GLUE基准测试中取得与完整微调相同的分数，在问答任务中达到相当水平。