We introduce Contextual Vision Transformers (ContextViT), a method designed to generate robust image representations for datasets experiencing shifts in latent factors across various groups. Derived from the concept of in-context learning, ContextViT incorporates an additional context token to encapsulate group-specific information. This integration allows the model to adjust the image representation in accordance with the group-specific context. Specifically, for a given input image, ContextViT maps images with identical group membership into this context token, which is appended to the input image tokens. Additionally, we introduce a context inference network to predict such tokens on-the-fly, given a batch of samples from the group. This enables ContextViT to adapt to new testing distributions during inference time. We demonstrate the efficacy of ContextViT across a wide range of applications. In supervised fine-tuning, we show that augmenting pre-trained ViTs with our proposed context conditioning mechanism results in consistent improvements in out-of-distribution generalization on iWildCam and FMoW. We also investigate self-supervised representation learning with ContextViT. Our experiments on the Camelyon17 pathology imaging benchmark and the JUMP-CP microscopy imaging benchmark demonstrate that ContextViT excels in learning stable image featurizations amidst distribution shift, consistently outperforming its ViT counterpart.
翻译:我们提出上下文视觉变换器(ContextViT),一种旨在为经历不同组间潜在因子分布偏移的数据集生成鲁棒图像表示的方法。受上下文学习概念的启发,ContextViT引入了一个额外的上下文标记以封装组特定信息。这种整合使模型能够根据组特定上下文调整图像表示。具体而言,对于给定的输入图像,ContextViT将具有相同组隶属关系的图像映射到该上下文标记中,并将其附加到输入图像标记之后。此外,我们引入了一个上下文推理网络,在给定一组来自该组的样本时,能够实时预测此类标记。这使得ContextViT能够在推理阶段适应新的测试分布。我们在广泛的应用中展示了ContextViT的有效性。在监督微调中,我们表明将预训练的ViT与我们提出的上下文条件机制结合,能在iWildCam和FMoW数据集上持续提升分布外泛化性能。我们还研究了使用ContextViT进行自监督表示学习。在Camelyon17病理图像基准和JUMP-CP显微镜图像基准上的实验表明,ContextViT在分布偏移下擅长学习稳定的图像特征表示,且性能始终优于其ViT对应模型。