Large pre-trained vision-language models (VLMs), such as CLIP, have shown unprecedented zero-shot performance across a wide range of tasks. Nevertheless, these models may be unreliable under distributional shifts, as their performance is significantly degraded. In this work, we investigate how to efficiently utilize class text information to mitigate distribution drifts encountered by VLMs during inference. In particular, we propose generating pseudo-labels for the noisy test-time samples by aligning visual embeddings with reliable, text-based semantic anchors. Specifically, to maintain the regular structure of the dataset properly, we formulate the problem as a batch-wise label assignment, which is efficiently solved using Optimal Transport. Our method, Semantic Anchor Transport (SAT), utilizes such pseudo-labels as supervisory signals for test-time adaptation, yielding a principled cross-modal alignment solution. Moreover, SAT further leverages heterogeneous textual clues, with a multi-template distillation approach that replicates multi-view contrastive learning strategies in unsupervised representation learning without incurring additional computational complexity. Extensive experiments on multiple popular test-time adaptation benchmarks presenting diverse complexity empirically show the superiority of SAT, achieving consistent performance gains over recent state-of-the-art methods, yet being computationally efficient.
翻译:大型预训练视觉语言模型(如CLIP)已在广泛任务中展现出前所未有的零样本性能。然而,这些模型在分布偏移下可能不可靠,其性能会出现显著退化。本研究探讨如何高效利用类别文本信息以缓解视觉语言模型在推理过程中遇到的分布漂移问题。具体而言,我们提出通过对齐视觉嵌入与可靠的基于文本的语义锚点,为含噪声的测试时样本生成伪标签。特别地,为保持数据集的正则结构,我们将该问题建模为批处理式标签分配任务,并利用最优传输方法高效求解。我们提出的语义锚点传输方法(SAT)将此类伪标签作为测试时自适应的监督信号,从而构建出原则性的跨模态对齐解决方案。此外,SAT进一步通过多模板蒸馏策略利用异构文本线索,在无监督表示学习中复现多视角对比学习机制,且不引入额外计算复杂度。在多个具有不同复杂度的主流测试时自适应基准上的大量实验表明,SAT方法在保持计算高效性的同时,相较当前最先进方法能获得持续的性能提升,展现出显著优势。