Current large-scale generative models have impressive efficiency in generating high-quality images based on text prompts. However, they lack the ability to precisely control the size and position of objects in the generated image. In this study, we analyze the generative mechanism of the stable diffusion model and propose a new interactive generation paradigm that allows users to specify the position of generated objects without additional training. Moreover, we propose an object detection-based evaluation metric to assess the control capability of location aware generation task. Our experimental results show that our method outperforms state-of-the-art methods on both control capacity and image quality.
翻译:当前大规模生成模型在基于文本提示生成高质量图像方面展现出卓越效率,但其无法精确控制生成图像中物体的大小与位置。本研究通过分析稳定扩散模型的生成机制,提出一种新型交互式生成范式,允许用户在不进行额外训练的情况下指定生成物体的位置。此外,我们提出基于物体检测的评估指标,用于衡量布局感知生成任务的控制能力。实验结果表明,本方法在控制能力与图像质量方面均优于现有最优方法。