Misalignment between the outputs of a vision-language (VL) model and task goal hinders its deployment. This issue can worsen when there are distribution shifts between the training and test data. To address this problem, prevailing fully test-time adaptation~(TTA) methods bootstrap themselves through entropy minimization. However, minimizing the entropy of the predictions makes the model overfit to incorrect output distributions of itself. In this work, we propose TTA with feedback to avoid such overfitting and align the model with task goals. Specifically, we adopt CLIP as reward model to provide feedback for VL models during test time in various tasks, including image classification, image-text retrieval, and image captioning. Given a single test sample, the model aims to maximize CLIP reward through reinforcement learning. We adopt a reward design with the average CLIP score of sampled candidates as the baseline. This design is simple and surprisingly effective when combined with various task-specific sampling strategies. The entire system is flexible, allowing the reward model to be extended with multiple CLIP models. Plus, a momentum buffer can be used to memorize and leverage the learned knowledge from multiple test samples. Extensive experiments demonstrate that our method significantly improves different VL models after TTA.
翻译:视觉语言(VL)模型输出与任务目标之间的不一致性阻碍了其实际部署。当训练数据与测试数据之间存在分布偏移时,该问题会进一步恶化。针对这一挑战,现有全测试时自适应(TTA)方法通常通过熵最小化策略实现模型自适应性。然而,最小化预测结果的熵会使模型过度拟合自身的错误输出分布。为此,本文提出一种带反馈机制的测试时自适应方法,以规避此类过拟合问题,并使模型与任务目标对齐。具体而言,我们采用CLIP作为奖励模型,在图像分类、图像-文本检索及图像描述等多样化测试任务中为VL模型提供反馈。对于单个测试样本,模型通过强化学习最大化CLIP奖励。我们设计了以采样候选对象平均CLIP分数为基线的奖励机制,该设计简洁高效,在与各类任务特定采样策略结合时展现出惊人的效果。整个系统具有高度灵活性,支持扩展使用多个CLIP模型作为奖励模型。此外,动量缓冲机制可用于存储并利用多个测试样本的已学习知识。大量实验表明,本方法能显著提升不同VL模型在测试时自适应后的性能。