Visual in-context learning (VICL), as a new paradigm in computer vision, allows the model to rapidly adapt to various tasks with only a handful of prompts and examples. While effective, the existing VICL paradigm exhibits poor generalizability under distribution shifts. In this work, we propose test-time Visual In-Context Tuning (VICT), a method that can adapt VICL models on the fly with a single test sample. Specifically, we flip the role between the task prompts and the test sample and use a cycle consistency loss to reconstruct the original task prompt output. Our key insight is that a model should be aware of a new test distribution if it can successfully recover the original task prompts. Extensive experiments on six representative vision tasks ranging from high-level visual understanding to low-level image processing, with 15 common corruptions, demonstrate that our VICT can improve the generalizability of VICL to unseen new domains. In addition, we show the potential of applying VICT for unseen tasks at test time. Code: https://github.com/Jiahao000/VICT.
翻译:视觉上下文学习作为一种计算机视觉新范式,使得模型仅需少量提示和示例即可快速适应各类任务。尽管现有VICL范式效果显著,但其在分布偏移条件下泛化能力较差。本研究提出测试时视觉上下文调优方法,能够基于单一样本实现VICL模型的实时自适应。具体而言,我们通过交换任务提示与测试样本的角色,并采用循环一致性损失重构原始任务提示输出。核心洞见在于:若模型能成功恢复原始任务提示,则表明其已感知到新的测试分布。在涵盖高层视觉理解至底层图像处理的六项代表性视觉任务上,通过15种常见干扰类型的大规模实验表明,VICT能有效提升VICL对未见新领域的泛化能力。此外,我们展示了VICT在测试时应用于未知任务的潜力。代码:https://github.com/Jiahao000/VICT。