Visual In-Context Learning (VICL) has emerged as a powerful paradigm, enabling models to perform novel visual tasks by learning from in-context examples. The dominant "retrieve-then-prompt" approach typically relies on selecting the single best visual prompt, a practice that often discards valuable contextual information from other suitable candidates. While recent work has explored fusing the top-K prompts into a single, enhanced representation, this still simply collapses multiple rich signals into one, limiting the model's reasoning capability. We argue that a more multi-faceted, collaborative fusion is required to unlock the full potential of these diverse contexts. To address this limitation, we introduce a novel framework that moves beyond single-prompt fusion towards an multi-combination collaborative fusion. Instead of collapsing multiple prompts into one, our method generates three contextual representation branches, each formed by integrating information from different combinations of top-quality prompts. These complementary guidance signals are then fed into proposed MULTI-VQGAN architecture, which is designed to jointly interpret and utilize collaborative information from multiple sources. Extensive experiments on diverse tasks, including foreground segmentation, single-object detection, and image colorization, highlight its strong cross-task generalization, effective contextual fusion, and ability to produce more robust and accurate predictions than existing methods.
翻译:视觉上下文学习(VICL)已成为一种强大的范式,它使模型能够通过上下文示例学习来执行新的视觉任务。目前主流的“检索-提示”方法通常依赖于选择单个最佳视觉提示,这种做法往往会丢弃其他合适候选者中宝贵的上下文信息。尽管最近的研究尝试将前K个提示融合为单一的增强表示,但这仍然只是将多个丰富信号简单压缩为一个,限制了模型的推理能力。我们认为,需要一种更多面、更协作的融合方式来充分释放这些多样化上下文的潜力。为解决这一局限,我们提出了一种新颖的框架,该框架超越了单提示融合,转向多组合协作融合。我们的方法不是将多个提示压缩为一个,而是生成三个上下文表示分支,每个分支通过整合来自高质量提示的不同组合信息而形成。这些互补的引导信号随后被输入到我们提出的MULTI-VQGAN架构中,该架构旨在联合解释并利用来自多源的协作信息。在包括前景分割、单目标检测和图像着色在内的多种任务上进行的大量实验表明,相较于现有方法,我们的框架具有强大的跨任务泛化能力、有效的上下文融合能力,并能产生更鲁棒和准确的预测。