Large-scale models trained on broad data have recently become the mainstream architecture in computer vision due to their strong generalization performance. In this paper, the main focus is on an emergent ability in large vision models, known as in-context learning, which allows inference on unseen tasks by conditioning on in-context examples (a.k.a.~prompt) without updating the model parameters. This concept has been well-known in natural language processing but has only been studied very recently for large vision models. We for the first time provide a comprehensive investigation on the impact of in-context examples in computer vision, and find that the performance is highly sensitive to the choice of in-context examples. To overcome the problem, we propose a prompt retrieval framework to automate the selection of in-context examples. Specifically, we present (1) an unsupervised prompt retrieval method based on nearest example search using an off-the-shelf model, and (2) a supervised prompt retrieval method, which trains a neural network to choose examples that directly maximize in-context learning performance. The results demonstrate that our methods can bring non-trivial improvements to visual in-context learning in comparison to the commonly-used random selection.
翻译:近期,在广泛数据上训练的大规模模型因其强大的泛化性能,已成为计算机视觉领域的主流架构。本文主要关注大型视觉模型中涌现的一种能力,即上下文学习(in-context learning),该能力允许模型在无需更新参数的情况下,通过条件化于上下文示例(即提示)来推理未见过的任务。这一概念在自然语言处理中已广为人知,但仅在近期才开始针对大型视觉模型进行研究。我们首次全面研究了上下文示例对计算机视觉任务的影响,发现模型性能对上下文示例的选择高度敏感。为解决这一问题,我们提出了一种提示检索框架,用于自动化选择上下文示例。具体而言,我们提出了:(1)一种无监督提示检索方法,基于现成模型进行最近邻示例搜索;(2)一种有监督提示检索方法,通过训练神经网络直接选择能最大化上下文学习性能的示例。结果表明,与常用的随机选择方法相比,我们的方法能为视觉上下文学习带来显著提升。