Exploring Effective Factors for Improving Visual In-Context Learning

The In-Context Learning (ICL) is to understand a new task via a few demonstrations (aka. prompt) and predict new inputs without tuning the models. While it has been widely studied in NLP, it is still a relatively new area of research in computer vision. To reveal the factors influencing the performance of visual in-context learning, this paper shows that prompt selection and prompt fusion are two major factors that have a direct impact on the inference performance of visual context learning. Prompt selection is the process of identifying the most appropriate prompt or example to help the model understand new tasks. This is important because providing the model with relevant prompts can help it learn more effectively and efficiently. Prompt fusion involves combining knowledge from different positions within the large-scale visual model. By doing this, the model can leverage the diverse knowledge stored in different parts of the model to improve its performance on new tasks. Based these findings, we propose a simple framework prompt-SelF for visual in-context learning. Specifically, we first use the pixel-level retrieval method to select a suitable prompt, and then use different prompt fusion methods to activate all the knowledge stored in the large-scale model, and finally ensemble the prediction results obtained from different prompt fusion methods to obtain the final prediction results. And we conduct extensive experiments on single-object segmentation and detection tasks to demonstrate the effectiveness of prompt-SelF. Remarkably, the prompt-SelF has outperformed OSLSM based meta-learning in 1-shot segmentation for the first time. This indicated the great potential of visual in-context learning. The source code and models will be available at \url{https://github.com/syp2ysy/prompt-SelF}.

翻译：上下文学习（In-Context Learning，ICL）是指通过少量示例（即提示）理解新任务，并对新输入进行预测，而无需调整模型参数。尽管该技术在自然语言处理领域已得到广泛研究，但在计算机视觉中仍相对新兴。为揭示影响视觉上下文学习性能的关键因素，本文表明提示选择与提示融合是直接影响视觉上下文学习推理性能的两大要素。提示选择是指识别最合适的提示或示例以帮助模型理解新任务的过程，其重要性在于向模型提供相关提示能使其学习更高效、更有效。提示融合则涉及整合大规模视觉模型中不同位置的知识，通过此方法，模型可利用存储于不同部分的多样化知识来提升新任务上的性能。基于上述发现，我们提出一个面向视觉上下文学习的简单框架——prompt-SelF。具体而言，我们首先采用像素级检索方法选取合适的提示，随后运用不同提示融合方法激活大规模模型中存储的全部知识，最后整合多种提示融合方法得到的预测结果以输出最终预测。我们在单目标分割与检测任务上开展了大量实验，验证了prompt-SelF的有效性。值得注意的是，prompt-SelF首次在一次性分割任务中超越了基于OSLSM的元学习方法，这彰显了视觉上下文学习的巨大潜力。源代码与模型将发布于\url{https://github.com/syp2ysy/prompt-SelF}。