Large-scale vision-language (V-L) models have demonstrated remarkable generalization capabilities for downstream tasks through prompt tuning. However, their performance suffers significantly in the presence of class imbalance, a common issue in real-world scenarios. In this paper, we investigate the effects of class imbalance on the generalization performance of V-L models and extend Neural Collapse phenomenon to these models, revealing the geometric reasons behind the impact of class imbalance on their generalization ability. To address this problem, we propose Neural Collapse based Prompt Tuning (NPT), a novel method that optimizes prompts so that both text and image features satisfy the same simplex ETF structure. NPT incorporates two regularization terms, geometric de-biasing and multi-modal isomorphism, to enhance the robustness of V-L models under class imbalance conditions while maintaining their generalization capabilities. Our comprehensive experiments show that NPT outperforms existing prompt learning techniques across 11 diverse image recognition datasets, achieving an absolute average gain of 2.63\% for novel classes and 2.47\% for harmonic mean when facing imbalanced data.
翻译:大规模视觉-语言(V-L)模型通过提示调优在下游任务中展现了卓越的泛化能力。然而,在真实场景中普遍存在的类别不平衡问题下,其性能会显著下降。本文研究了类别不平衡对V-L模型泛化性能的影响,并将神经坍缩现象扩展到这些模型,揭示了类别不平衡影响其泛化能力的几何原因。针对该问题,我们提出基于神经坍缩的提示调优方法(NPT),该方法通过优化提示使文本和图像特征均满足同一单纯形ETF结构。NPT引入几何去偏置与多模态同构两个正则化项,以增强V-L模型在类别不平衡条件下的鲁棒性,同时保持其泛化能力。综合实验表明,在11个不同的图像识别数据集上,NPT超越现有提示学习方法,面对不平衡数据时在新类别上实现2.63%的绝对平均增益,在调和均值上实现2.47%的绝对平均增益。