DePT: Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning

Prompt tuning (PT), where a small amount of trainable soft (continuous) prompt vectors is affixed to the input of language models (LM), has shown promising results across various tasks and models for parameter-efficient fine-tuning (PEFT). PT stands out from other PEFT approaches because it maintains competitive performance with fewer trainable parameters and does not drastically scale up its parameters as the model size expands. However, PT introduces additional soft prompt tokens, leading to longer input sequences, which significantly impacts training and inference time and memory usage due to the Transformer's quadratic complexity. Particularly concerning for Large Language Models (LLMs) that face heavy daily querying. To address this issue, we propose Decomposed Prompt Tuning (DePT), which decomposes the soft prompt into a shorter soft prompt and a pair of low-rank matrices that are then optimised with two different learning rates. This allows DePT to achieve better performance while saving substantial memory and time costs compared to vanilla PT and its variants, without changing trainable parameter sizes. Through extensive experiments on 23 natural language processing (NLP) and vision-language (VL) tasks, we demonstrate that DePT outperforms state-of-the-art PEFT approaches, including the full fine-tuning baseline, in some scenarios. Additionally, we empirically show that DEPT grows more efficient as the model size increases. Our further study reveals that DePT integrates seamlessly with parameter-efficient transfer learning in the few-shot learning setting and highlights its adaptability to various model architectures and sizes.

翻译：提示调优（PT）是一种将少量可训练的软（连续）提示向量附加到语言模型输入的方法，在参数高效微调（PEFT）的各类任务和模型中展现了良好效果。PT区别于其他PEFT方法之处在于，它用更少的可训练参数保持竞争性性能，并且不会随模型规模扩大而大幅增加参数量。然而，PT引入了额外的软提示标记，导致输入序列变长，由于Transformer的二次复杂度，这会显著影响训练和推理时间以及内存使用。对于面临大量日常查询的大型语言模型（LLMs）而言，这一问题尤为突出。为解决此问题，我们提出了分解式提示调优（DePT），将软提示分解为一个更短的软提示和一对低秩矩阵，并使用两种不同学习率进行优化。这使得DePT在保持可训练参数规模不变的前提下，相比原始PT及其变体实现了更优性能，同时大幅节省内存和时间成本。通过在23项自然语言处理（NLP）和视觉-语言（VL）任务上的广泛实验，我们证明DePT在部分场景中优于包括全微调基线在内的最先进PEFT方法。此外，我们凭经验表明，DePT的效率随模型规模增大而提升。进一步研究发现，DePT能无缝集成到少样本学习场景中的参数高效迁移学习，并突出其对多种模型架构和规模的适应性。