Vision In-Context Learning (VICL) enables inpainting models to quickly adapt to new visual tasks from only a few prompts. However, existing methods suffer from two key issues: (1) selecting only the most similar prompt discards complementary cues from other high-quality prompts; and (2) failing to exploit the structured information implied by different prompt arrangements. We propose an end-to-end VICL framework to overcome these limitations. Firstly, an adaptive Fusion Module aggregates critical patterns and annotations from multiple prompts to form more precise contextual prompts. Secondly, we introduce arrangement-specific lightweight MLPs to decouple layout priors from the core model, while minimally affecting the overall model. In addition, an bidirectional fine-tuning mechanism swaps the roles of query and prompt, encouraging the model to reconstruct the original prompt from fused context and thus enhancing collaboration between the fusion module and the inpainting model. Experiments on foreground segmentation, single-object detection, and image colorization demonstrate superior results and strong cross-task generalization of our method.
翻译:视觉上下文学习(VICL)使修复模型能够仅通过少量提示快速适应新的视觉任务。然而,现有方法存在两个关键问题:(1)仅选择最相似的提示会丢弃其他高质量提示中的互补线索;(2)未能利用不同提示排列所隐含的结构化信息。我们提出了一个端到端的VICL框架以克服这些局限。首先,一个自适应融合模块聚合来自多个提示的关键模式和标注,以形成更精确的上下文提示。其次,我们引入了排列特定的轻量级MLP,将布局先验从核心模型中解耦,同时对整体模型的影响降至最低。此外,一种双向微调机制交换查询和提示的角色,鼓励模型从融合的上下文中重建原始提示,从而增强融合模块与修复模型之间的协作。在前景分割、单目标检测和图像着色任务上的实验证明了我们方法的优越结果和强大的跨任务泛化能力。