Large vision-language models (VLMs) have become state-of-the-art for many computer vision tasks, with in-context learning (ICL) as a popular adaptation strategy for new ones. But can VLMs learn novel concepts purely from visual demonstrations, or are they limited to adapting to the output format of ICL examples? We propose a new benchmark we call Spatial Visual Ambiguity Tasks (SVAT) that challenges state-of-the-art VLMs to learn new visuospatial tasks in-context. We find that VLMs fail to do this zero-shot, and sometimes continue to fail after finetuning. However, adding simpler data to the training by curriculum learning leads to improved ICL performance.
翻译:大型视觉语言模型(VLMs)已成为许多计算机视觉任务的最先进技术,其中上下文学习(ICL)是一种针对新任务的流行适应策略。但是,VLM能否仅从视觉演示中学习新概念,还是仅限于适应ICL示例的输出格式?我们提出了一个名为空间视觉模糊任务(SVAT)的新基准,旨在挑战最先进的VLM在上下文中学习新的视觉空间任务的能力。我们发现,VLM在零样本设置下无法完成此任务,有时甚至在微调后仍然失败。然而,通过课程学习在训练中添加更简单的数据可以提升ICL性能。