We present Generative Interpretable Fine-Tuning (GIFT) for parameter-efficient fine-tuning of pretrained Transformer backbones, which can be formulated as a simple factorized matrix multiplication in the parameter space or equivalently in the activation space, and thus embraces built-in interpretability. For a pretrained layer with weights $\omega\in \mathbb{R}^{d_{out}\times d_{in}}$, our proposed GIFT learns the fine-tuned weights $\hat{\omega}$ directly from $\omega$ as $\hat{\omega}=\omega \cdot (\mathbb{I}+\phi_{d_{in}\times r}\cdot \psi_{r\times d_{in}})$ where $\mathbb{I}$ is an identity matrix. $\Theta=(\phi, \psi)$ are the learnable parameters of the two linear layers of GIFT with $r$ being a hyper-parameter. $\Theta$ is shared by all the layers selected for fine-tuning, resulting in significantly fewer trainable parameters compared to Low-Rank Adaptation (LoRA). We perform comprehensive evaluations on natural language tasks (commonsense reasoning and sequence classification) and computer vision tasks (visual fine-grained classification). We obtain the best accuracy and parameter efficiency among baselines both on the Commonsense170k reasoning benchmark using LLaMA-1 (7B) and Llama-2 (7B)/-3 (8B) and on the FGVC and VTAB visual recognition benchmarks using ImageNet-21k pretrained Vision Transformer (ViT-B/16). Notably, we obtain 5.9% absolute increase in average accuracy with 53.8 times reduction of parameters on Commonsense170k using Llama-3 (8B) compared to LoRA. We obtain performance comparable to LoRA on the GLUE benchmark but with significantly fewer parameters using RoBERTa-Base/Large. We show the output of the first linear layer (i.e., $\omega\cdot \phi$) is surprisingly interpretable, which can play the role of a token-clustering head as a by-product to localize meaningful objects/parts in images for computer vision tasks. Our code is publicly available.
翻译:我们提出生成式可解释微调(GIFT),用于预训练Transformer骨干网络的参数高效微调。该方法可表述为参数空间或等效激活空间中的简单因子化矩阵乘法,从而具备内置可解释性。对于权重为$\omega\in \mathbb{R}^{d_{out}\times d_{in}}$的预训练层,我们提出的GIFT直接通过$\hat{\omega}=\omega \cdot (\mathbb{I}+\phi_{d_{in}\times r}\cdot \psi_{r\times d_{in}})$从$\omega$学习微调权重,其中$\mathbb{I}$为单位矩阵。$\Theta=(\phi, \psi)$是GIFT两个线性层的可学习参数,$r$为超参数。$\Theta$在所有选定微调的层间共享,相比低秩自适应(LoRA)显著减少了可训练参数量。我们在自然语言任务(常识推理与序列分类)和计算机视觉任务(视觉细粒度分类)上进行了全面评估。使用LLaMA-1(7B)及Llama-2(7B)/(8B)在Commonsense170k推理基准测试中,以及使用ImageNet-21k预训练的Vision Transformer(ViT-B/16)在FGVC与VTAB视觉识别基准测试中,我们均获得了基线方法中最优的准确率与参数效率。值得注意的是,在Commonsense170k数据集上使用Llama-3(8B)时,相比LoRA实现了平均准确率5.9%的绝对提升,同时参数量减少53.8倍。使用RoBERTa-Base/Large在GLUE基准测试中获得了与LoRA相当的性能,但参数量显著减少。我们发现首个线性层输出(即$\omega\cdot \phi$)具有显著的可解释性,可作为副产物扮演词元聚类头部的角色,在计算机视觉任务中定位图像中的有意义物体/部件。我们的代码已公开。