Recent text-driven image editing in diffusion models has shown remarkable success. However, the existing methods assume that the user's description sufficiently grounds the contexts in the source image, such as objects, background, style, and their relations. This assumption is unsuitable for real-world applications because users have to manually engineer text prompts to find optimal descriptions for different images. From the users' standpoint, prompt engineering is a labor-intensive process, and users prefer to provide a target word for editing instead of a full sentence. To address this problem, we first demonstrate the importance of a detailed text description of the source image, by dividing prompts into three categories based on the level of semantic details. Then, we propose simple yet effective methods by combining prompt generation frameworks, thereby making the prompt engineering process more user-friendly. Extensive qualitative and quantitative experiments demonstrate the importance of prompts in text-driven image editing and our method is comparable to ground-truth prompts.
翻译:论文摘要:近年来,基于扩散模型的文本驱动图像编辑取得了显著成功。然而,现有方法假设用户描述能够充分涵盖源图像中的上下文信息,如物体、背景、风格及其相互关系。这一假设在实际应用中并不合理,因为用户需手动设计文本提示词(prompts)以针对不同图像寻找最优描述。从用户角度看,提示词工程是劳动密集型流程,用户更倾向于提供目标词汇而非完整句子进行编辑。为解决此问题,我们首先通过将提示词按语义细节层级分为三类,论证了源图像详细文本描述的重要性。随后,我们提出结合提示词生成框架的简单有效方法,使提示词工程流程更加用户友好。大量定性与定量实验证明了提示词在文本驱动图像编辑中的重要性,并且我们的方法效果与真实提示词(ground-truth prompts)相当。