This paper introduces LeftRefill, an innovative approach to efficiently harness large Text-to-Image (T2I) diffusion models for reference-guided image synthesis. As the name implies, LeftRefill horizontally stitches reference and target views together as a whole input. The reference image occupies the left side, while the target canvas is positioned on the right. Then, LeftRefill paints the right-side target canvas based on the left-side reference and specific task instructions. Such a task formulation shares some similarities with contextual inpainting, akin to the actions of a human painter. This novel formulation efficiently learns both structural and textured correspondence between reference and target without other image encoders or adapters. We inject task and view information through cross-attention modules in T2I models, and further exhibit multi-view reference ability via the re-arranged self-attention modules. These enable LeftRefill to perform consistent generation as a generalized model without requiring test-time fine-tuning or model modifications. Thus, LeftRefill can be seen as a simple yet unified framework to address reference-guided synthesis. As an exemplar, we leverage LeftRefill to address two different challenges: reference-guided inpainting and novel view synthesis, based on the pre-trained StableDiffusion. Codes and models are released at https://github.com/ewrfcas/LeftRefill.
翻译:本文提出了LeftRefill,一种高效利用大型文本到图像(T2I)扩散模型进行参考引导图像合成的新方法。顾名思义,LeftRefill将参考视图与目标视图水平拼接为整体输入:参考图像占据左侧,目标画布位于右侧。随后,LeftRefill根据左侧参考和特定任务指令绘制右侧目标画布。这种任务设置与上下文修补具有相似性,类似于人类画家的创作行为。该创新方法无需额外图像编码器或适配器,即可高效学习参考与目标间的结构和纹理对应关系。我们通过T2I模型中的交叉注意力模块注入任务与视图信息,并通过重排的自注意力模块进一步展现多视图参考能力。这使得LeftRefill作为通用模型能够实现一致性生成,无需测试时微调或模型修改。因此,LeftRefill可被视为解决参考引导合成的简单统一框架。作为示例,我们基于预训练的StableDiffusion模型,利用LeftRefill应对两种不同挑战:参考引导修补和新视角合成。代码与模型已发布于https://github.com/ewrfcas/LeftRefill。