Recent advancements in Text-to-Image (T2I) generative models have yielded impressive results in generating high-fidelity images based on consistent text prompts. However, there is a growing interest in exploring the potential of these models for more diverse reference-based image manipulation tasks that require spatial understanding and visual context. Previous approaches have achieved this by incorporating additional control modules or fine-tuning the generative models specifically for each task until convergence. In this paper, we propose a different perspective. We conjecture that current large-scale T2I generative models already possess the capability to perform these tasks but are not fully activated within the standard generation process. To unlock these capabilities, we introduce a unified Prompt-Guided In-Context inpainting (PGIC) framework, which leverages large-scale T2I models to re-formulate and solve reference-guided image manipulations. In the PGIC framework, the reference and masked target are stitched together as a new input for the generative models, enabling the filling of masked regions as producing final results. Furthermore, we demonstrate that the self-attention modules in T2I models are well-suited for establishing spatial correlations and efficiently addressing challenging reference-guided manipulations. These large T2I models can be effectively driven by task-specific prompts with minimal training cost or even with frozen backbones. We synthetically evaluate the effectiveness of the proposed PGIC framework across various tasks, including reference-guided image inpainting, faithful inpainting, outpainting, local super-resolution, and novel view synthesis. Our results show that PGIC achieves significantly better performance while requiring less computation compared to other fine-tuning based approaches.
翻译:近年来,文本到图像生成模型在基于一致性文本提示生成高保真图像方面取得了显著成果。然而,研究者对探索这些模型在更具多样性的参考驱动图像操作任务中的潜力日益关注,这类任务需要空间理解与视觉上下文。以往的方法通过引入额外控制模块或针对每个任务微调生成模型直至收敛来实现这一目标。本文提出了一种不同视角:我们推测当前大规模文本到图像生成模型已具备执行这些任务的能力,但在标准生成过程中未被完全激活。为释放这些能力,我们提出了统一提示引导上下文修补框架,该框架利用大规模文本到图像模型重构并解决参考引导的图像操作问题。在该框架中,参考图像与掩码目标图像被拼接成生成模型的新输入,通过填充掩码区域生成最终结果。此外,我们证明了文本到图像模型中的自注意力模块天然适合建立空间关联,并能高效处理具有挑战性的参考引导操作。这些大规模文本到图像模型可通过任务特定提示有效驱动,仅需极低训练成本甚至保持骨干网络冻结即可。我们综合评估了所提框架在参考引导图像修补、忠实修复、外推修复、局部超分辨率及新视角合成等多种任务中的有效性。结果表明,与其他基于微调的方法相比,PGIC在显著降低计算开销的同时实现了更优性能。