Instruction-based image editing improves the controllability and flexibility of image manipulation via natural commands without elaborate descriptions or regional masks. However, human instructions are sometimes too brief for current methods to capture and follow. Multimodal large language models (MLLMs) show promising capabilities in cross-modal understanding and visual-aware response generation via LMs. We investigate how MLLMs facilitate edit instructions and present MLLM-Guided Image Editing (MGIE). MGIE learns to derive expressive instructions and provides explicit guidance. The editing model jointly captures this visual imagination and performs manipulation through end-to-end training. We evaluate various aspects of Photoshop-style modification, global photo optimization, and local editing. Extensive experimental results demonstrate that expressive instructions are crucial to instruction-based image editing, and our MGIE can lead to a notable improvement in automatic metrics and human evaluation while maintaining competitive inference efficiency.
翻译:指令式图像编辑通过自然语言指令实现图像操作的灵活性与可控性,无需复杂的区域描述或掩码。然而,现有方法难以准确理解并遵循过于简略的人类指令。多模态大语言模型(MLLMs)凭借其跨模态理解与基于语言模型的视觉感知响应生成能力展现出广阔前景。本研究探索MLLMs如何优化编辑指令,并提出MLLM指导的图像编辑框架(MGIE)。MGIE通过学习生成表达性指令并提供显式指导,其编辑模型通过端到端训练联合捕获视觉想象并执行图像操作。我们分别从Photoshop风格修改、全局照片优化与局部编辑等维度展开评估。大量实验结果表明,表达性指令对于指令式图像编辑至关重要,而MGIE在保持高效推理的同时,显著提升了自动评估指标与人工评审效果。