We propose Context Diffusion, a diffusion-based framework that enables image generation models to learn from visual examples presented in context. Recent work tackles such in-context learning for image generation, where a query image is provided alongside context examples and text prompts. However, the quality and fidelity of the generated images deteriorate when the prompt is not present, demonstrating that these models are unable to truly learn from the visual context. To address this, we propose a novel framework that separates the encoding of the visual context and preserving the structure of the query images. This results in the ability to learn from the visual context and text prompts, but also from either one of them. Furthermore, we enable our model to handle few-shot settings, to effectively address diverse in-context learning scenarios. Our experiments and user study demonstrate that Context Diffusion excels in both in-domain and out-of-domain tasks, resulting in an overall enhancement in image quality and fidelity compared to counterpart models.
翻译:我们提出语境扩散(Context Diffusion),一种基于扩散的框架,使图像生成模型能够从上下文中呈现的视觉示例中学习。近期工作探索了图像生成中的这种上下文学习,其中查询图像与上下文示例及文本提示一同提供。然而,当提示缺失时,生成图像的质量与保真度会下降,这表明这些模型无法真正从视觉上下文中学习。为解决此问题,我们提出一种新颖框架,将视觉上下文的编码与查询图像结构的保留分离开来。这使其能够同时从视觉上下文和文本提示中学习,也可仅从二者之一学习。此外,我们使模型能够处理少样本设置,有效应对多样化的上下文学习场景。我们的实验和用户研究表明,语境扩散在域内和域外任务中均表现出色,与其他模型相比,整体上提升了图像质量与保真度。