Recent diffusion-based models achieve photorealism in image inpainting but require many sampling steps, limiting practical use. Few-step text-to-image models offer faster generation, but naively applying them to inpainting yields poor harmonization and artifacts between the background and inpainted region. We trace this cause to random Gaussian noise initialization, which under low function evaluations causes semantic misalignment and reduced fidelity. To overcome this, we propose InverFill, a one-step inversion method tailored for inpainting that injects semantic information from the input masked image into the initial noise, enabling high-fidelity few-step inpainting. Instead of training inpainting models, InverFill leverages few-step text-to-image models in a blended sampling pipeline with semantically aligned noise as input, significantly improving vanilla blended sampling and even matching specialized inpainting models at low NFEs. Moreover, InverFill does not require real-image supervision and only adds minimal inference overhead. Extensive experiments show that InverFill consistently boosts baseline few-step models, improving image quality and text coherence without costly retraining or heavy iterative optimization.
翻译:近期基于扩散的模型在图像修复中实现了照片级真实感,但需要大量采样步骤,限制了实际应用。少步文本到图像模型提供了更快的生成速度,但直接将其应用于修复会导致背景与修复区域之间的协调性差及伪影。我们将此归因于随机高斯噪声初始化,在低函数评估次数下会导致语义错位和保真度下降。为解决该问题,我们提出InverFill——一种专为修复设计的单步反演方法,该方法将输入掩码图像的语义信息注入初始噪声,从而实现高保真的少步修复。InverFill无需训练修复模型,而是通过融合语义对齐噪声输入的混合采样流程,利用少步文本到图像模型,显著改善了标准混合采样效果,甚至在低NFE条件下可与专用修复模型相媲美。此外,InverFill无需真实图像监督,仅增加极小的推理开销。大量实验表明,InverFill能持续提升基线少步模型性能,在无需昂贵重训练或繁重迭代优化的情况下,改善图像质量与文本一致性。