Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models

Misalignment between the outputs of a vision-language (VL) model and task goal hinders its deployment. This issue can worsen when there are distribution shifts between the training and test data. To address this problem, prevailing fully test-time adaptation~(TTA) methods bootstrap themselves through entropy minimization. However, minimizing the entropy of the predictions makes the model overfit to incorrect output distributions of itself. In this work, we propose TTA with feedback to avoid such overfitting and align the model with task goals. Specifically, we adopt CLIP as reward model to provide feedback for VL models during test time in various tasks, including image classification, image-text retrieval, and image captioning. Given a single test sample, the model aims to maximize CLIP reward through reinforcement learning. We adopt a reward design with the average CLIP score of sampled candidates as the baseline. This design is simple and surprisingly effective when combined with various task-specific sampling strategies. The entire system is flexible, allowing the reward model to be extended with multiple CLIP models. Plus, a momentum buffer can be used to memorize and leverage the learned knowledge from multiple test samples. Extensive experiments demonstrate that our method significantly improves different VL models after TTA.

翻译：视觉语言（VL）模型输出与任务目标之间的不一致性阻碍了其实际部署。当训练数据与测试数据之间存在分布偏移时，该问题会进一步恶化。针对这一挑战，现有全测试时自适应（TTA）方法通常通过熵最小化策略实现模型自适应性。然而，最小化预测结果的熵会使模型过度拟合自身的错误输出分布。为此，本文提出一种带反馈机制的测试时自适应方法，以规避此类过拟合问题，并使模型与任务目标对齐。具体而言，我们采用CLIP作为奖励模型，在图像分类、图像-文本检索及图像描述等多样化测试任务中为VL模型提供反馈。对于单个测试样本，模型通过强化学习最大化CLIP奖励。我们设计了以采样候选对象平均CLIP分数为基线的奖励机制，该设计简洁高效，在与各类任务特定采样策略结合时展现出惊人的效果。整个系统具有高度灵活性，支持扩展使用多个CLIP模型作为奖励模型。此外，动量缓冲机制可用于存储并利用多个测试样本的已学习知识。大量实验表明，本方法能显著提升不同VL模型在测试时自适应后的性能。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

76+阅读 · 2022年6月28日

【开放书】卡耐基梅隆大学Elaine Shi 教授《Foundations of Distributed Consensus and Blockchains（分布式共识和区块链的基础）》150页pdf

专知会员服务

30+阅读 · 2022年2月22日

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

105+阅读 · 2022年2月10日