Ramen: Robust Test-Time Adaptation of Vision-Language Models with Active Sample Selection

Pretrained vision-language models such as CLIP exhibit strong zero-shot generalization but remain sensitive to distribution shifts. Test-time adaptation adapts models during inference without access to source data or target labels, offering a practical way to handle such shifts. However, existing methods typically assume that test samples come from a single, consistent domain, while in practice, test data often include samples from mixed domains with distinct characteristics. Consequently, their performance degrades under mixed-domain settings. To address this, we present Ramen, a framework for robust test-time adaptation through active sample selection. For each incoming test sample, Ramen retrieves a customized batch of relevant samples from previously seen data based on two criteria: domain consistency, which ensures that adaptation focuses on data from similar domains, and prediction balance, which mitigates adaptation bias caused by skewed predictions. To improve efficiency, Ramen employs an embedding-gradient cache that stores the embeddings and sample-level gradients of past test images. The stored embeddings are used to retrieve relevant samples, and the corresponding gradients are aggregated for model updates, eliminating the need for any additional forward or backward passes. Our theoretical analysis provides insight into why the proposed adaptation mechanism is effective under mixed-domain shifts. Experiments on multiple image corruption and domain-shift benchmarks demonstrate that Ramen achieves strong and consistent performance, offering robust and efficient adaptation in complex mixed-domain scenarios. Our code is available at https://github.com/baowenxuan/Ramen .

翻译：预训练的视觉-语言模型（如CLIP）展现出强大的零样本泛化能力，但对数据分布偏移仍较为敏感。测试时自适应方法无需访问源数据或目标标签，即可在推理阶段调整模型，为应对此类偏移提供了实用方案。然而，现有方法通常假设测试样本来自单一、一致的领域，而实际应用中测试数据常包含来自多个具有不同特征的混合领域样本，导致其在混合领域场景下性能下降。针对这一问题，我们提出Ramen框架，通过主动样本选择实现鲁棒的测试时自适应。对于每个输入的测试样本，Ramen基于两个准则从先前观测数据中检索定制化的相关样本批次：领域一致性确保自适应聚焦于相似领域的数据，预测平衡性则缓解因预测偏差导致的自适应偏置。为提升效率，Ramen采用嵌入-梯度缓存机制，存储历史测试图像的嵌入向量和样本级梯度。存储的嵌入用于检索相关样本，对应梯度则经聚合后更新模型，无需额外的前向或反向传播计算。我们的理论分析揭示了该自适应机制在混合领域偏移下有效性的内在机理。在多个图像损坏和领域偏移基准上的实验表明，Ramen在复杂混合领域场景下实现了稳健且一致的自适应性能，兼具鲁棒性与高效性。代码开源地址：https://github.com/baowenxuan/Ramen。