Instruction-based image editing improves the controllability and flexibility of image manipulation via natural commands without elaborate descriptions or regional masks. However, human instructions are sometimes too brief for current methods to capture and follow. Multimodal large language models (MLLMs) show promising capabilities in cross-modal understanding and visual-aware response generation via LMs. We investigate how MLLMs facilitate edit instructions and present MLLM-Guided Image Editing (MGIE). MGIE learns to derive expressive instructions and provides explicit guidance. The editing model jointly captures this visual imagination and performs manipulation through end-to-end training. We evaluate various aspects of Photoshop-style modification, global photo optimization, and local editing. Extensive experimental results demonstrate that expressive instructions are crucial to instruction-based image editing, and our MGIE can lead to a notable improvement in automatic metrics and human evaluation while maintaining competitive inference efficiency.
翻译:指令式图像编辑通过自然语言指令提升了图像操作的可控性与灵活性,无需详尽描述或区域遮罩。然而,用户指令有时过于简短,导致现有方法难以准确捕捉并执行。多模态大语言模型(MLLMs)在跨模态理解及通过语言模型生成视觉感知响应方面展现出潜力。我们探究了MLLMs如何优化编辑指令,并提出MLLM引导的图像编辑(MGIE)。MGIE通过学习推导出富有表现力的指令,并提供明确指导。该编辑模型通过端到端训练联合捕捉视觉想象并执行操作。我们评估了Photoshop风格调整、全局照片优化及局部编辑等多个方面。大量实验表明,富有表现力的指令对指令式图像编辑至关重要,而我们的MGIE在保持竞争性推理效率的同时,在自动评估指标与人工评价上均实现了显著提升。