Test-time adaptation (TTA) of visual language models has recently attracted significant attention as a solution to the performance degradation caused by distribution shifts in downstream tasks. However, existing cache-based TTA methods have certain limitations. They mainly rely on the accuracy of cached feature labels, and the presence of noisy pseudo-labels can cause these features to deviate from their true distribution. This makes cache retrieval methods based on similarity matching highly sensitive to outliers or extreme samples. Moreover, current methods lack effective mechanisms to model class distributions, which limits their ability to fully exploit the potential of cached information. To address these challenges, we introduce a comprehensive and reliable caching mechanism and propose a novel zero-shot TTA method called "Cache, Residual, Gaussian" (CRG). This method not only employs learnable residual parameters to better align positive and negative visual prototypes with text prototypes, thereby optimizing the quality of cached features, but also incorporates Gaussian Discriminant Analysis (GDA) to dynamically model intra-class feature distributions, further mitigating the impact of noisy features. Experimental results on 13 benchmarks demonstrate that CRG outperforms state-of-the-art TTA methods, showcasing exceptional robustness and adaptability.
翻译:视觉语言模型的测试时适应(TTA)作为解决下游任务中因分布偏移导致性能下降的一种方案,近来受到广泛关注。然而,现有的基于缓存的TTA方法存在一定局限性。它们主要依赖于缓存特征标签的准确性,而噪声伪标签的存在会导致这些特征偏离其真实分布。这使得基于相似性匹配的缓存检索方法对异常值或极端样本高度敏感。此外,当前方法缺乏有效的机制来建模类别分布,这限制了其充分挖掘缓存信息潜力的能力。为应对这些挑战,我们引入了一种全面且可靠的缓存机制,并提出了一种名为"缓存、残差、高斯"(CRG)的新型零样本TTA方法。该方法不仅采用可学习的残差参数来更好地对齐正负视觉原型与文本原型,从而优化缓存特征的质量,还结合了高斯判别分析(GDA)来动态建模类内特征分布,进一步缓解噪声特征的影响。在13个基准测试上的实验结果表明,CRG优于最先进的TTA方法,展现出卓越的鲁棒性和适应性。