CLIP-SVD: Efficient and Interpretable Vision-Language Adaptation via Singular Values

Vision-language models (VLMs) like CLIP have shown impressive zero-shot and few-shot learning capabilities across diverse applications. However, adapting these models to new fine-grained domains remains difficult due to reliance on prompt engineering and the high cost of full model fine-tuning. Existing adaptation approaches rely on augmented components, such as prompt tokens and adapter modules, which could limit adaptation quality, destabilize the model, and compromise the rich knowledge learned during pretraining. In this work, we present CLIP-SVD, a multi-modal and parameter-efficient adaptation framework that applies Singular Value Fine-tuning (SVF) to CLIP, leveraging Singular Value Decomposition (SVD) to modify the internal parameter space of CLIP without injecting additional modules. Specifically, we fine-tune only the singular values of the CLIP parameter matrices to rescale the basis vectors for domain adaptation while retaining the pretrained model. This design enables enhanced adaptation performance using only 0.04% of the model's total parameters and better preservation of its generalization ability. CLIP-SVD achieves state-of-the-art classification results on 11 natural and 10 biomedical datasets, outperforming previous methods in both accuracy and generalization under few-shot settings. Additionally, we leverage a natural language-based approach to analyze the effectiveness and dynamics of the CLIP adaptation to allow interpretability of CLIP-SVD. Overall, this work provides the first extensive empirical evaluation of SVD-based finetuning in the vision-language model setting. The code and biomedical corpus are publicly available at https://github.com/HealthX-Lab/CLIP-SVD.

翻译：视觉-语言模型（如CLIP）在多种应用中展现出卓越的零样本和小样本学习能力。然而，由于依赖提示工程且全模型微调成本高昂，将这些模型适配至细粒度新领域仍存在挑战。现有适配方法依赖增强组件（如提示标记和适配器模块），这可能会限制适配质量、破坏模型稳定性，并损害预训练过程中习得的丰富知识。本文提出CLIP-SVD，一种多模态且参数高效的适配框架，将奇异值微调（SVF）应用于CLIP，通过奇异值分解（SVD）修改CLIP内部参数空间，而无需引入额外模块。具体而言，我们仅微调CLIP参数矩阵的奇异值，以重标定基向量实现领域适配，同时保留预训练模型。该设计仅使用模型总参数的0.04%即可增强适配性能，并更好地保持其泛化能力。CLIP-SVD在11个自然图像和10个生物医学数据集上取得了最优分类结果，在小样本设置下准确率和泛化能力均优于先前方法。此外，我们利用基于自然语言的方法分析CLIP适配的有效性与动态机制，实现了CLIP-SVD的可解释性。总体而言，本工作首次在视觉-语言模型场景中对基于SVD的微调方法进行了广泛实证评估。代码与生物医学语料库已公开于https://github.com/HealthX-Lab/CLIP-SVD。