Replicating In-Context Learning (ICL) in computer vision remains challenging due to task heterogeneity. We propose \textbf{VIRAL}, a framework that elicits visual reasoning from a pre-trained image editing model by formulating ICL as conditional generation via visual analogy ($x_s : x_t :: x_q : y_q$). We adapt a frozen Diffusion Transformer (DiT) using role-aware multi-image conditioning and introduce a Mixture-of-Experts LoRA to mitigate gradient interference across diverse tasks. Additionally, to bridge the gaps in current visual context datasets, we curate a large-scale dataset spanning perception, restoration, and editing. Experiments demonstrate that VIRAL outperforms existing methods, validating that a unified V-ICL paradigm can handle the majority of visual tasks, including open-domain editing. Our code is available at https://anonymous.4open.science/r/VIRAL-744A
翻译:在计算机视觉领域复制上下文学习(ICL)仍面临任务异质性的挑战。我们提出\textbf{VIRAL}框架,该框架通过将ICL建模为基于视觉类比的条件生成($x_s : x_t :: x_q : y_q$),从预训练图像编辑模型中激发视觉推理能力。我们采用角色感知多图像条件化方法适配冻结的扩散Transformer(DiT),并引入混合专家LoRA以减轻跨异构任务的梯度干扰。此外,为弥补当前视觉上下文数据集的不足,我们构建了涵盖感知、修复与编辑任务的大规模数据集。实验表明VIRAL优于现有方法,验证了统一的视觉ICL范式能够处理包括开放域编辑在内的大多数视觉任务。代码发布于https://anonymous.4open.science/r/VIRAL-744A