ImageBrush: Learning Visual In-Context Instructions for Exemplar-Based Image Manipulation

While language-guided image manipulation has made remarkable progress, the challenge of how to instruct the manipulation process faithfully reflecting human intentions persists. An accurate and comprehensive description of a manipulation task using natural language is laborious and sometimes even impossible, primarily due to the inherent uncertainty and ambiguity present in linguistic expressions. Is it feasible to accomplish image manipulation without resorting to external cross-modal language information? If this possibility exists, the inherent modality gap would be effortlessly eliminated. In this paper, we propose a novel manipulation methodology, dubbed ImageBrush, that learns visual instructions for more accurate image editing. Our key idea is to employ a pair of transformation images as visual instructions, which not only precisely captures human intention but also facilitates accessibility in real-world scenarios. Capturing visual instructions is particularly challenging because it involves extracting the underlying intentions solely from visual demonstrations and then applying this operation to a new image. To address this challenge, we formulate visual instruction learning as a diffusion-based inpainting problem, where the contextual information is fully exploited through an iterative process of generation. A visual prompting encoder is carefully devised to enhance the model's capacity in uncovering human intent behind the visual instructions. Extensive experiments show that our method generates engaging manipulation results conforming to the transformations entailed in demonstrations. Moreover, our model exhibits robust generalization capabilities on various downstream tasks such as pose transfer, image translation and video inpainting.

翻译：尽管语言引导的图像操作已取得显著进展，如何使操作过程忠实反映人类意图仍是一大挑战。由于语言表达固有的不确定性和模糊性，使用自然语言对操作任务进行准确全面的描述既费时费力，甚至有时不可能。能否在不借助外部跨模态语言信息的情况下完成图像操作？如果这种可能性存在，模态间固有的鸿沟将轻易消除。本文提出一种名为ImageBrush的新型操作方法，通过学习可视化指令实现更精准的图像编辑。核心思想是使用一对变换图像作为可视化指令，既能精确捕捉人类意图，又便于现实场景中的应用。捕捉可视化指令极具挑战性，因为它需要仅从视觉演示中提取潜在意图，进而将该操作应用于新图像。为解决这一问题，我们将可视化指令学习建模为基于扩散的图像补全任务，通过迭代生成过程充分利用上下文信息。我们精心设计了视觉提示编码器，以增强模型揭示可视化指令背后人类意图的能力。大量实验表明，该方法能生成符合演示中变换要求且引人入胜的操作结果。此外，模型在姿态迁移、图像翻译和视频补全等下游任务中展现出强大的泛化能力。