Text-to-image generation has witnessed great progress, especially with the recent advancements in diffusion models. Since texts cannot provide detailed conditions like object appearance, reference images are usually leveraged for the control of objects in the generated images. However, existing methods still suffer limited accuracy when the relationship between the foreground and background is complicated. To address this issue, we develop a framework termed Mask-ControlNet by introducing an additional mask prompt. Specifically, we first employ large vision models to obtain masks to segment the objects of interest in the reference image. Then, the object images are employed as additional prompts to facilitate the diffusion model to better understand the relationship between foreground and background regions during image generation. Experiments show that the mask prompts enhance the controllability of the diffusion model to maintain higher fidelity to the reference image while achieving better image quality. Comparison with previous text-to-image generation methods demonstrates our method's superior quantitative and qualitative performance on the benchmark datasets.
翻译:文本到图像生成技术取得了显著进展,尤其是近期扩散模型的突破。由于文本无法提供物体外观等详细条件,通常需借助参考图像来控制生成图像中的物体。然而,现有方法在前景与背景关系复杂时,仍存在控制精度有限的问题。为此,我们提出名为Mask-ControlNet的框架,通过引入附加掩码提示来应对这一挑战。具体而言,我们首先利用大视觉模型获取掩码,以分割参考图像中的目标对象;随后将对象图像作为附加提示输入扩散模型,帮助其更好地理解图像生成过程中前景与背景区域的关系。实验表明,掩码提示增强了扩散模型的可控性,使其在保持参考图像高保真度的同时,生成更高质量的图像。与现有文本到图像生成方法的对比显示,我们的方法在基准数据集上取得了量化和定性性能的显著优势。