Vision-Language Models (VLMs) demonstrate impressive zero-shot generalization through large-scale image-text pretraining, yet their performance can drop once the deployment distribution diverges from the training distribution. To address this, Test-Time Adaptation (TTA) methods update models using unlabeled target data. However, existing approaches often ignore two key challenges: prototype degradation in long-tailed distributions and confusion between semantically similar classes. To tackle these issues, we propose \textbf{C}lass-Aware \textbf{P}rototype \textbf{L}earning with \textbf{N}egative \textbf{C}ontrast(\textbf{CPL-NC}), a lightweight TTA framework designed specifically for VLMs to enhance generalization under distribution shifts. CPL-NC introduces a \textit{Class-Aware Prototype Cache} Module that dynamically adjusts per-class capacity based on test-time frequency and activation history, with a rejuvenation mechanism for inactive classes to retain rare-category knowledge. Additionally, a \textit{Negative Contrastive Learning} Mechanism identifies and constrains hard visual-textual negatives to improve class separability. The framework employs asymmetric optimization, refining only textual prototypes while anchoring on stable visual features. Experiments on 15 benchmarks show that CPL-NC consistently outperforms prior TTA methods across both ResNet-50 and ViT-B/16 backbones.
翻译:视觉语言模型(VLMs)通过大规模图文预训练展现出卓越的零样本泛化能力,然而当部署分布偏离训练分布时,其性能可能下降。为解决此问题,测试时适应(TTA)方法利用未标注的目标数据更新模型。然而,现有方法常忽略两个关键挑战:长尾分布中的原型退化以及语义相似类别间的混淆。为应对这些问题,我们提出**C**lass-Aware **P**rototype **L**earning with **N**egative **C**ontrast(**CPL-NC**),一个专为VLMs设计的轻量级TTA框架,旨在增强分布偏移下的泛化能力。CPL-NC引入一个**类别感知原型缓存**模块,该模块基于测试时频率与激活历史动态调整每类容量,并配备针对非活跃类别的再生机制以保留稀有类别知识。此外,一个**负对比学习**机制通过识别并约束困难的视觉-文本负样本来提升类别可分性。该框架采用非对称优化策略,仅优化文本原型,同时锚定于稳定的视觉特征。在15个基准测试上的实验表明,CPL-NC在ResNet-50与ViT-B/16骨干网络上均持续优于先前的TTA方法。