Conditional diffusion models have demonstrated impressive performance on various tasks like text-guided semantic image editing. Prior work requires image regions to be identified manually by human users or use an object detector that only perform well for object-centric manipulations. For example, if an input image contains multiple objects with the same semantic meaning (such as a group of birds), object detectors may struggle to recognize and localize the target object, let alone accurately manipulate it. To address these challenges, we propose a two-stage method for achieving complex scene image editing by Scene Graph Comprehension (SGC-Net). In the first stage, we train a Region of Interest (RoI) prediction network that uses scene graphs and predict the locations of the target objects. Unlike object detection methods based solely on object category, our method can accurately recognize the target object by comprehending the objects and their semantic relationships within a complex scene. The second stage uses a conditional diffusion model to edit the image based on our RoI predictions. We evaluate the effectiveness of our approach on the CLEVR and Visual Genome datasets. We report an 8 point improvement in SSIM on CLEVR and our edited images were preferred by human users by 9-33% over prior work on Visual Genome, validating the effectiveness of our proposed method. Code is available at github.com/Zhongping-Zhang/SGC_Net.
翻译:条件扩散模型在文本引导的语义图像编辑等各类任务中展现出令人瞩目的性能。现有方法需要人工手动标注图像区域,或使用仅对以物体为中心的操控表现良好的目标检测器。例如,若输入图像包含多个语义相同的物体(如一群鸟),目标检测器可能难以识别并定位目标物体,更遑论精确操控。为应对这些挑战,我们提出一种基于场景图理解的两阶段复杂场景图像编辑方法(SGC-Net)。第一阶段,我们训练一个感兴趣区域预测网络,利用场景图预测目标物体的位置。与仅依赖物体类别的目标检测方法不同,我们的方法通过理解复杂场景中的物体及其语义关系,能够精确识别目标物体。第二阶段,基于我们的RoI预测结果,使用条件扩散模型对图像进行编辑。我们在CLEVR和Visual Genome数据集上评估了该方法的效果。在CLEVR上,我们报告的SSIM指标提升了8个百分点;而在Visual Genome上,人类用户对我们编辑图像的偏好度较现有方法提升了9-33%,验证了所提方法的有效性。代码已开源于github.com/Zhongping-Zhang/SGC_Net。