Large-scale models trained on broad data have recently become the mainstream architecture in computer vision due to their strong generalization performance. In this paper, the main focus is on an emergent ability in large vision models, known as in-context learning, which allows inference on unseen tasks by conditioning on in-context examples (a.k.a.~prompt) without updating the model parameters. This concept has been well-known in natural language processing but has only been studied very recently for large vision models. We for the first time provide a comprehensive investigation on the impact of in-context examples in computer vision, and find that the performance is highly sensitive to the choice of in-context examples. To overcome the problem, we propose a prompt retrieval framework to automate the selection of in-context examples. Specifically, we present (1) an unsupervised prompt retrieval method based on nearest example search using an off-the-shelf model, and (2) a supervised prompt retrieval method, which trains a neural network to choose examples that directly maximize in-context learning performance. The results demonstrate that our methods can bring non-trivial improvements to visual in-context learning in comparison to the commonly-used random selection.
翻译:近期,在大规模数据上训练的模型凭借其强大的泛化能力,已成为计算机视觉领域的主流架构。本文重点研究大型视觉模型中涌现的一种能力——上下文学习,即在无需更新模型参数的情况下,通过上下文示例(即提示)实现对未见任务的推理。这一概念在自然语言处理领域已广为人知,但针对大型视觉模型的研究才刚刚起步。我们首次系统探究了上下文示例对计算机视觉任务性能的影响,发现模型性能对上下文示例的选择高度敏感。为解决这一问题,我们提出了一种提示检索框架,用于自动筛选上下文示例。具体而言,我们提出了:(1)基于现成模型的最邻近示例搜索的无监督提示检索方法;(2)通过训练神经网络直接最大化上下文学习性能的有监督提示检索方法。实验结果表明,与常用的随机选择方法相比,我们的方法能够为视觉上下文学习带来显著的性能提升。