A common strategy for Parameter-Efficient Fine-Tuning (PEFT) of pre-trained Vision Transformers (ViTs) involves adapting the model to downstream tasks by learning a low-rank adaptation matrix. This matrix is decomposed into a product of down-projection and up-projection matrices, with the bottleneck dimensionality being crucial for reducing the number of learnable parameters, as exemplified by prevalent methods like LoRA and Adapter. However, these low-rank strategies typically employ a fixed bottleneck dimensionality, which limits their flexibility in handling layer-wise variations. To address this limitation, we propose a novel PEFT approach inspired by Singular Value Decomposition (SVD) for representing the adaptation matrix. SVD decomposes a matrix into the product of a left unitary matrix, a diagonal matrix of scaling values, and a right unitary matrix. We utilize Householder transformations to construct orthogonal matrices that efficiently mimic the unitary matrices, requiring only a vector. The diagonal values are learned in a layer-wise manner, allowing them to flexibly capture the unique properties of each layer. This approach enables the generation of adaptation matrices with varying ranks across different layers, providing greater flexibility in adapting pre-trained models. Experiments on standard downstream vision tasks demonstrate that our method achieves promising fine-tuning performance.
翻译:预训练视觉Transformer(ViT)的参数高效微调(PEFT)常用策略是通过学习低秩适配矩阵使模型适应下游任务。该矩阵被分解为下投影矩阵与上投影矩阵的乘积,其中瓶颈维度对于减少可学习参数数量至关重要,如LoRA和Adapter等主流方法所示。然而,这些低秩策略通常采用固定的瓶颈维度,限制了其处理层间差异的灵活性。为解决此限制,我们提出一种受奇异值分解(SVD)启发的全新PEFT方法来表示适配矩阵。SVD将矩阵分解为左酉矩阵、缩放值对角矩阵和右酉矩阵的乘积。我们利用Householder变换构建正交矩阵,仅需一个向量即可高效模拟酉矩阵。对角值以分层方式学习,使其能灵活捕捉每层的独特特性。该方法能够生成不同层间具有可变秩的适配矩阵,为预训练模型的适配提供了更大灵活性。在标准下游视觉任务上的实验表明,我们的方法实现了优异的微调性能。