Image editing has advanced significantly with the development of diffusion models using both inversion-based and instruction-based methods. However, current inversion-based approaches struggle with big modifications (e.g., adding or removing objects) due to the structured nature of inversion noise, which hinders substantial changes. Meanwhile, instruction-based methods often constrain users to black-box operations, limiting direct interaction for specifying editing regions and intensity. To address these limitations, we propose BrushEdit, a novel inpainting-based instruction-guided image editing paradigm, which leverages multimodal large language models (MLLMs) and image inpainting models to enable autonomous, user-friendly, and interactive free-form instruction editing. Specifically, we devise a system enabling free-form instruction editing by integrating MLLMs and a dual-branch image inpainting model in an agent-cooperative framework to perform editing category classification, main object identification, mask acquisition, and editing area inpainting. Extensive experiments show that our framework effectively combines MLLMs and inpainting models, achieving superior performance across seven metrics including mask region preservation and editing effect coherence.
翻译:随着基于反转和基于指令的扩散模型的发展,图像编辑技术已取得显著进步。然而,当前基于反转的方法由于反转噪声的结构化特性,难以实现大幅修改(例如添加或移除对象),这阻碍了实质性变更。同时,基于指令的方法通常将用户限制在黑箱操作中,限制了用于指定编辑区域和强度的直接交互。为应对这些局限性,我们提出了BrushEdit——一种基于修复的指令引导图像编辑新范式,它利用多模态大语言模型(MLLMs)和图像修复模型,实现自主、用户友好且交互式的自由形式指令编辑。具体而言,我们设计了一个通过整合MLLMs与双分支图像修复模型的系统,在智能体协作框架下执行编辑类别分类、主体对象识别、掩码获取及编辑区域修复,从而实现自由形式指令编辑。大量实验表明,我们的框架有效结合了MLLMs与修复模型,在掩码区域保持和编辑效果连贯性等七项指标上均取得了优异性能。