Visual In-Context Learning (VICL) aims to complete vision tasks by imitating pixel demonstrations. Recent work pioneered prompt fusion that combines the advantages of various demonstrations, which shows a promising way to extend VICL. Unfortunately, the patch-wise fusion framework and model-agnostic supervision hinder the exploitation of informative cues, thereby limiting performance gains. To overcome this deficiency, we introduce PromptHub, a framework that holistically strengthens multi-prompting through locality-aware fusion, concentration and alignment. PromptHub exploits spatial priors to capture richer contextual information, employs complementary concentration, alignment, and prediction objectives to mutually guide training, and incorporates data augmentation to further reinforce supervision. Extensive experiments on three fundamental vision tasks demonstrate the superiority of PromptHub. Moreover, we validate its universality, transferability, and robustness across out-of-distribution settings, and various retrieval scenarios. This work establishes a reliable locality-aware paradigm for prompt fusion, moving beyond prior patch-wise approaches. Code is available at https://github.com/luotc-why/ICLR26-PromptHub.
翻译:视觉上下文学习意在通过模仿像素级示例完成视觉任务。近期研究开创性地提出提示融合方法,通过结合不同示例的优势拓展视觉上下文学习。然而,分片式融合框架与模型无关的监督机制阻碍了信息线索的有效挖掘,从而限制了性能提升。为克服这一缺陷,我们提出PromptHub框架,通过局部感知融合、聚焦与对齐全面强化多提示机制。该框架利用空间先验捕获更丰富的上下文信息,采用互补的聚焦、对齐与预测目标协同指导训练,并通过数据增强进一步强化监督。在三个基础视觉任务上的大量实验证明了PromptHub的优越性。此外,我们还验证了其在分布外场景及多种检索设置下的通用性、可迁移性与鲁棒性。本研究建立了超越先前分片式方法的可靠局部感知提示融合范式。代码已开源至https://github.com/luotc-why/ICLR26-PromptHub。