Vision-Language Models (VLMs) such as CLIP have become a standard backbone for open-vocabulary recognition, yet their zero-shot predictions remain vulnerable to distribution shifts encountered at deployment. Test-Time Adaptation (TTA) has recently been extended to CLIP as a lightweight solution, leading to a rapidly growing body of TTA4CLIP methods. However, empirical progress in this area has largely outpaced our understanding of what truly drives adaptation, where their gains originate, and under which shifts they remain reliable. In this paper, we take a step back from the pursuit of state-of-the-art accuracy and conduct a systematic controlled study of TTA4CLIP. We first organize existing methods into three unified paradigms according to what is updated at test time. We then introduce TTABC, an open-source TTA Benchmark for CLIP, which standardizes evaluation protocols and integrates more than 20 representative methods. Our controlled empirical analysis focuses on three key areas. First, we determine the driving factors in parameter-based methods, revealing that adaptation gains are primarily driven by test-time evidence and reliable proxies rather than heavy optimization. Second, we explore evidence utilization beyond heavy parameter tuning, showing that competitive and efficient performance can be achieved through cross- or current-sample evidence and lightweight prototype updates. Finally, we demonstrate that there is no silver bullet for TTA: no single adaptation paradigm is universally optimal, and the preferred paradigm depends on the nature of shift. We hope our benchmark and study provide a clearer understanding of the current TTA4CLIP landscape and establish a foundation for further research.
翻译:视觉-语言模型(如CLIP)已成为开放词汇识别的标准主干网络,但其零样本预测在部署时仍易受分布偏移的影响。测试时自适应(TTA)近期作为一种轻量级方案被引入CLIP,催生了快速增长的TTA4CLIP方法体系。然而,该领域的实验进展已远超我们对以下核心问题的理解:真正驱动自适应过程的关键因素是什么?性能增益源自何处?在何种偏移下仍保持可靠性?本文暂缓追求最先进精度,转向对TTA4CLIP开展系统性受控研究。首先,我们根据测试时的更新对象,将现有方法归纳为三种统一范式;继而提出TTABC——面向CLIP的开源TTA基准,该基准标准化评估协议并整合了20余种代表性方法。我们的受控实证分析聚焦三个关键维度:第一,厘清基于参数的方法的驱动因素,揭示自适应增益主要源于测试时证据与可靠代理机制,而非重度优化过程;第二,探索超越重度参数调优的证据利用方式,证明通过跨样本/当前样本证据与轻量级原型更新即可实现具有竞争力的高效性能;第三,论证TTA不存在银弹——没有单一自适应范式能普遍最优,范式选择取决于偏移类型。我们期望该基准与研究成果能为当前TTA4CLIP领域提供更清晰的理论图景,并为后续研究奠定基础。