Large-scale vision-language (V-L) models have demonstrated remarkable generalization capabilities for downstream tasks through prompt tuning. However, the mechanisms behind the learned text representations are unknown, limiting further generalization gains, especially under class imbalance scenarios. Recent advances in the neural collapse (NC) phenomenon of vision-only models suggest that the optimal representation structure is the simplex ETF, which paves the way to study representations in V-L models. In this paper, we make the first attempt to use NC for examining the representations in V-L models via prompt tuning. It is found that NC optimality of text-to-image representations shows a positive correlation with downstream generalizability, which is more severe under class imbalance settings. To improve the representations, we propose Neural-collapse-anchored Prompt Tuning (NPT), a novel method that learns prompts with text and image representations that satisfy the same simplex ETF. NPT incorporates two regularization terms: language-modality collapse and multi-modality isomorphism; and it is compatible with other prompt tuning methods. Extensive experiments show that NPT can consistently help to improve existing prompt tuning techniques across 11 datasets for both balanced and imbalanced settings.
翻译:大规模视觉-语言模型通过提示调优在下游任务中展现了卓越的泛化能力。然而,学习到的文本表示背后的机制尚不明确,这限制了其泛化能力的进一步提升,尤其是在类别不平衡场景下。近期关于纯视觉模型中神经坍塌现象的研究表明,最优表示结构为单纯形等角紧框架,这为研究视觉-语言模型中的表示开辟了道路。本文首次尝试利用神经坍塌来通过提示调优检验视觉-语言模型中的表示。研究发现,文本到图像表示的神经坍塌最优性与下游泛化能力呈正相关,且在类别不平衡设置下这一关联更为显著。为改进表示,我们提出神经坍塌锚定提示调优方法,这是一种通过满足相同单纯形等角紧框架的文本与图像表示来学习提示的新方法。神经坍塌锚定提示调优包含两个正则化项:语言模态坍塌与多模态同构;它可与其他提示调优方法兼容。大量实验表明,在11个数据集的平衡与不平衡设置下,神经坍塌锚定提示调优均能持续提升现有提示调优技术的性能。