Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Cross-Regularization

Efficient finetuning of vision-language models (VLMs) like CLIP for specific downstream tasks is gaining significant attention. Previous works primarily focus on prompt learning to adapt the CLIP into a variety of downstream tasks, however, suffering from task overfitting when finetuned on a small data set. In this paper, we introduce an orthogonal finetuning method for efficiently updating pretrained weights which enhances robustness and generalization, while a cross-regularization strategy is further exploited to maintain the stability in terms of zero-shot generalization of VLMs, dubbed \textbf{\textit{OrthCR}}. Specifically, trainable orthogonal matrices are injected seamlessly into the transformer architecture and enforced with orthogonality constraint using Cayley parameterization, benefiting from the norm-preserving property and thus leading to stable and faster convergence. To alleviate deviation from orthogonal constraint during training, a cross-regularization strategy is further employed with initial pretrained weights within a bypass manner. In addition, to enrich the sample diversity for downstream tasks, we first explore Cutout data augmentation to boost the efficient finetuning and comprehend how our approach improves the specific downstream performance and maintains the generalizability in the perspective of Orthogonality Learning. Beyond existing prompt learning techniques, we conduct extensive experiments to demonstrate that our method explicitly steers pretrained weight space to represent the task-specific knowledge and presents competitive generalizability under \textit{base-to-base/base-to-new}, \textit{cross-dataset transfer} and \textit{domain generalization} evaluations.

翻译：针对特定下游任务对视觉-语言模型（如CLIP）进行高效微调正受到广泛关注。现有研究主要集中于通过提示学习使CLIP适应各类下游任务，但在小数据集上微调时易出现任务过拟合问题。本文提出一种正交微调方法，通过高效更新预训练权重以增强模型鲁棒性与泛化能力，并进一步采用交叉正则化策略维持视觉-语言模型的零样本泛化稳定性，该方法命名为\textbf{\textit{OrthCR}}。具体而言，我们将可训练正交矩阵无缝注入Transformer架构，并利用凯莱参数化施加正交约束，得益于其保范特性，该方法能实现稳定且更快的收敛。为缓解训练过程中对正交约束的偏离，我们进一步通过旁路方式结合初始预训练权重实施交叉正则化策略。此外，为增强下游任务的样本多样性，我们首次探索采用Cutout数据增强技术以提升微调效率，并从正交学习的角度阐释本方法如何改善特定下游任务性能并保持泛化能力。相较于现有提示学习技术，我们通过大量实验证明：本方法能显式引导预训练权重空间表征任务特定知识，并在\textit{基类到基类/基类到新类}、\textit{跨数据集迁移}及\textit{域泛化}评估中展现出卓越的泛化性能。