Visual in-context learning has been proposed as a pathway towards dynamic models that can generate predictions based on a provided context and thereby can adapt to new vision tasks at test-time. Yet, the evaluation of the adaptation capabilities of these models has been limited to narrow setups that mainly mirror tasks or image domains from pre-training for which real adaptation is not required. We address this gap by constructing a broad Visual In-Context BEnchmark (VIBE) with a focus on diverse imaging domains and a wide range of tasks. With this, we are able to get a much clearer picture of the adaptive capabilities of visual in-context models when faced with new image- and task distributions. We stress test six models on $14$ datasets and $12$ tasks (in total, we explore $106$ dataset-task combinations) and compare them under a unified, reproducible evaluation protocol, in an one-shot setting. Our evaluation uncovers key insights on the state of visual in-context learning, including limitations, systematic failure modes and promising directions. To foster broader evaluation, we will openly release our VIBE toolkit.
翻译:视觉上下文学习已被提出作为一种通向动态模型的途径,这些模型能够基于提供的上下文生成预测,从而在测试时适应新的视觉任务。然而,这些模型适应能力的评估一直局限于狭窄的设置,主要反映预训练中的任务或图像领域,而实际适应并不需要。我们通过构建一个广泛的视觉上下文基准(VIBE),聚焦于多样化的成像领域和广泛的任务,填补了这一空白。借此,我们能够更清晰地了解视觉上下文模型在面对新图像和任务分布时的自适应能力。我们对6个模型在14个数据集和12个任务上进行了压力测试(总共探索了106种数据集-任务组合),并在统一、可复现的评估协议下,以一次性设置进行比较。我们的评估揭示了视觉上下文学习的现状的关键见解,包括局限性、系统性故障模式和有前景的方向。为促进更广泛的评估,我们将公开发布我们的VIBE工具包。