Recent advancements in generative models have revolutionized image generation and editing, making these tasks accessible to non-experts. This paper focuses on local image editing, particularly the task of adding new content to a loosely specified area. Existing methods often require a precise mask or a detailed description of the location, which can be cumbersome and prone to errors. We propose Click2Mask, a novel approach that simplifies the local editing process by requiring only a single point of reference (in addition to the content description). A mask is dynamically grown around this point during a Blended Latent Diffusion (BLD) process, guided by a masked CLIP-based semantic loss. Click2Mask surpasses the limitations of segmentation-based and fine-tuning dependent methods, offering a more user-friendly and contextually accurate solution. Our experiments demonstrate that Click2Mask not only minimizes user effort but also delivers competitive or superior local image manipulation results compared to SoTA methods, according to both human judgement and automatic metrics. Key contributions include the simplification of user input, the ability to freely add objects unconstrained by existing segments, and the integration potential of our dynamic mask approach within other editing methods.
翻译:近年来,生成模型的进步彻底改变了图像生成与编辑领域,使得非专业人士也能轻松完成这些任务。本文聚焦于局部图像编辑,特别是向大致指定的区域添加新内容的任务。现有方法通常需要精确的掩码或对位置的详细描述,这既繁琐又容易出错。我们提出Click2Mask这一新方法,它仅需一个参考点(除内容描述外)即可简化局部编辑流程。在混合潜在扩散(BLD)过程中,通过基于掩码的CLIP语义损失引导,围绕该点动态生成掩码。Click2Mask克服了基于分割和依赖微调的方法的局限,提供了更用户友好且上下文准确的解决方案。实验表明,无论是依据人工评判还是自动评估指标,Click2Mask不仅能极大减少用户操作负担,其局部图像编辑效果也达到或超越了当前最优方法。本工作的核心贡献包括:简化用户输入、突破现有分割约束自由添加对象的能力,以及动态掩码方法在其他编辑框架中的集成潜力。