State-of-the-art text-based image editing models often struggle to balance background preservation with semantic consistency, frequently resulting either in the synthesis of entirely new images or in outputs that fail to realize the intended edits. In contrast, scene graph-based image editing addresses this limitation by providing a structured representation of semantic entities and their relations, thereby offering improved controllability. However, existing scene graph editing methods typically depend on model fine-tuning, which incurs high computational cost and limits scalability. To this end, we introduce VENUS (Visual Editing with Noise inversion Using Scene graphs), a training-free framework for scene graph-guided image editing. Specifically, VENUS employs a split prompt conditioning strategy that disentangles the target object of the edit from its background context, while simultaneously leveraging noise inversion to preserve fidelity in unedited regions. Moreover, our proposed approach integrates scene graphs extracted from multimodal large language models with diffusion backbones, without requiring any additional training. Empirically, VENUS substantially improves both background preservation and semantic alignment on PIE-Bench, increasing PSNR from 22.45 to 24.80, SSIM from 0.79 to 0.84, and reducing LPIPS from 0.100 to 0.070 relative to the state-of-the-art scene graph editing model (SGEdit). In addition, VENUS enhances semantic consistency as measured by CLIP similarity (24.97 vs. 24.19). On EditVal, VENUS achieves the highest fidelity with a 0.87 DINO score and, crucially, reduces per-image runtime from 6-10 minutes to only 20-30 seconds. Beyond scene graph-based editing, VENUS also surpasses strong text-based editing baselines such as LEDIT++ and P2P+DirInv, thereby demonstrating consistent improvements across both paradigms.
翻译:现有的基于文本的图像编辑模型往往难以在背景保持与语义一致性之间取得平衡,常常导致生成完全不同的新图像或无法实现预期编辑效果。相比之下,基于场景图的图像编辑通过提供语义实体及其关系的结构化表示,解决了这一局限,从而提供了更好的可控性。然而,现有的场景图编辑方法通常依赖于模型微调,计算成本高昂且可扩展性受限。为此,我们提出了VENUS(基于场景图的噪声反演视觉编辑),一种无需训练的、由场景图引导的图像编辑框架。具体而言,VENUS采用分离提示条件策略,将编辑的目标对象与其背景上下文解耦,同时利用噪声反演来保持未编辑区域的保真度。此外,我们提出的方法将多模态大语言模型提取的场景图与扩散主干网络集成,无需任何额外训练。实验表明,在PIE-Bench基准上,VENUS显著提升了背景保持和语义对齐性能:相较于最先进的场景图编辑模型(SGEdit),PSNR从22.45提升至24.80,SSIM从0.79提升至0.84,LPIPS从0.100降低至0.070。此外,VENUS通过CLIP相似度(24.97 vs. 24.19)衡量也提升了语义一致性。在EditVal基准上,VENUS以0.87的DINO得分实现了最高的保真度,并且关键的是,将单张图像处理时间从6-10分钟大幅缩短至仅20-30秒。除了基于场景图的编辑任务,VENUS也超越了强大的基于文本的编辑基线方法(如LEDIT++和P2P+DirInv),从而在两种范式下均展现出持续的性能提升。