Image editing affords increased control over the aesthetics and content of generated images. Pre-existing works focus predominantly on text-based instructions to achieve desired image modifications, which limit edit precision and accuracy. In this work, we propose an inference-time editing optimisation, designed to extend beyond textual edits to accommodate multiple editing instruction types (e.g. spatial layout-based; pose, scribbles, edge maps). We propose to disentangle the editing task into two competing subtasks: successful local image modifications and global content consistency preservation, where subtasks are guided through two dedicated loss functions. By allowing to adjust the influence of each loss function, we build a flexible editing solution that can be adjusted to user preferences. We evaluate our method using text, pose and scribble edit conditions, and highlight our ability to achieve complex edits, through both qualitative and quantitative experiments.
翻译:图像编辑增强了对生成图像美学和内容的控制能力。现有工作主要依赖基于文本的指令来实现所需的图像修改,这限制了编辑的精确性和准确性。本文提出一种推理时编辑优化方法,其设计目标是通过扩展文本编辑范畴,支持多种编辑指令类型(例如基于空间布局的编辑、姿态、涂鸦、边缘图)。我们将编辑任务分解为两个相互竞争的子任务:局部图像修改成功性与全局内容一致性保持,这两个子任务分别通过专用损失函数进行引导。通过调整各损失函数的影响权重,我们构建了一个可根据用户偏好灵活调整的编辑解决方案。我们使用文本、姿态和涂鸦三种编辑条件评估了该方法,并通过定性与定量实验证明了该方法实现复杂编辑的能力。