Parameter-efficient fine-tuning (PEFT) of pre-trained language models (PLMs) has emerged as a highly successful approach, with training only a small number of parameters without sacrificing performance and becoming the de-facto learning paradigm with the increasing size of PLMs. However, existing PEFT methods are not memory-efficient, because they still require caching most of the intermediate activations for the gradient calculation, akin to fine-tuning. One effective way to reduce the activation memory is to apply a reversible model, so the intermediate activations are not necessary to be cached and can be recomputed. Nevertheless, modifying a PLM to its reversible variant with PEFT is not straightforward, since the reversible model has a distinct architecture from the currently released PLMs. In this paper, we first investigate what is a key factor for the success of existing PEFT methods, and realize that it's essential to preserve the PLM's starting point when initializing a PEFT method. With this finding, we propose memory-efficient fine-tuning (MEFT) that inserts adapters into a PLM, preserving the PLM's starting point and making it reversible without additional pre-training. We evaluate MEFT on the GLUE benchmark and five question-answering tasks with various backbones, BERT, RoBERTa, BART and OPT. MEFT significantly reduces the activation memory up to 84% of full fine-tuning with a negligible amount of trainable parameters. Moreover, MEFT achieves the same score on GLUE and a comparable score on the question-answering tasks as full fine-tuning.
翻译:参数高效微调(PEFT)已成为预训练语言模型(PLMs)的一种非常成功的方法,该方法仅训练少量参数即可保持性能,并随着PLMs规模的扩大成为事实上的学习范式。然而,现有的PEFT方法并非内存高效,因为它们仍需像全量微调那样缓存大部分中间激活值以计算梯度。降低激活内存的有效途径之一是采用可逆模型,这样中间激活值无需缓存,可通过重新计算获得。然而,将PLM改造为可逆变体并同时应用PEFT并非易事,因为可逆模型与当前发布的PLMs架构不同。本文首先探究了现有PEFT方法成功的关键因素,发现初始化PEFT方法时保持PLM的起始点至关重要。基于这一发现,我们提出内存高效微调(MEFT),该方法在PLM中插入适配器,既保留了PLM的起始点,又使其无需额外预训练即可实现可逆性。我们在GLUE基准测试和五个问答任务中使用不同主干网络(BERT、RoBERTa、BART和OPT)评估了MEFT。MEFT显著减少了激活内存占用,在全量微调基础上节省高达84%的内存,且仅需极少量的可训练参数。此外,MEFT在GLUE上达到全量微调相同的分数,在问答任务上取得与全量微调相当的成绩。