Vision-Language Models seamlessly discriminate among arbitrary semantic categories, yet they still suffer from poor generalization when presented with challenging examples. For this reason, Episodic Test-Time Adaptation (TTA) strategies have recently emerged as powerful techniques to adapt VLMs in the presence of a single unlabeled image. The recent literature on TTA is dominated by the paradigm of prompt tuning by Marginal Entropy Minimization, which, relying on online backpropagation, inevitably slows down inference while increasing memory. In this work, we theoretically investigate the properties of this approach and unveil that a surprisingly strong TTA method lies dormant and hidden within it. We term this approach ZERO (TTA with "zero" temperature), whose design is both incredibly effective and frustratingly simple: augment N times, predict, retain the most confident predictions, and marginalize after setting the Softmax temperature to zero. Remarkably, ZERO requires a single batched forward pass through the vision encoder only and no backward passes. We thoroughly evaluate our approach following the experimental protocol established in the literature and show that ZERO largely surpasses or compares favorably w.r.t. the state-of-the-art while being almost 10x faster and 13x more memory-friendly than standard Test-Time Prompt Tuning. Thanks to its simplicity and comparatively negligible computation, ZERO can serve as a strong baseline for future work in this field. The code is available at https://github.com/FarinaMatteo/zero.
翻译:视觉语言模型能够无缝地区分任意语义类别,然而在面对具有挑战性的样本时,其泛化能力仍然不足。为此,情景式测试时适应策略近期作为一种强大的技术出现,能够在仅有一张未标注图像的情况下对视觉语言模型进行适应。当前测试时适应的研究主要围绕基于边缘熵最小化的提示调优范式展开,该方法依赖在线反向传播,不可避免地会降低推理速度并增加内存占用。在本工作中,我们从理论上探究了该方法的性质,并揭示了一个令人惊讶的强大测试时适应方法潜藏其中。我们将此方法命名为ZERO(采用“零”温度参数的测试时适应),其设计既极其有效又令人沮丧地简单:进行N次数据增强、预测、保留置信度最高的预测结果,并在将Softmax温度设置为零后进行边缘化处理。值得注意的是,ZERO仅需对视觉编码器进行一次批量前向传播,无需任何反向传播过程。我们严格遵循文献中建立的实验协议对本方法进行评估,结果表明ZERO在性能上大幅超越或与现有最优方法相当,同时其运行速度提升近10倍,内存效率提高13倍。得益于其简单性和相对可忽略的计算开销,ZERO可作为该领域未来研究的强有力基准。代码发布于https://github.com/FarinaMatteo/zero。