Recent advances in multimodal large language models (MLLMs) have shown great potential for extending vision-language reasoning to professional tool-based image editing, enabling intuitive and creative editing. A promising direction is to use reinforcement learning (RL) to enable MLLMs to reason about and execute optimal tool-use plans within professional image-editing software. However, training remains challenging due to the lack of reliable, verifiable reward signals that can reflect the inherently subjective nature of creative editing. In this work, we introduce RetouchIQ, a framework that performs instruction-based executable image editing through MLLM agents guided by a generalist reward model. RetouchIQ interprets user-specified editing intentions and generates corresponding, executable image adjustments, bridging high-level aesthetic goals with precise parameter control. To move beyond conventional, rule-based rewards that compute similarity against a fixed reference image using handcrafted metrics, we propose a generalist reward model, an RL fine-tuned MLLM that evaluates edited results through a set of generated metrics on a case-by-case basis. Then, the reward model provides scalar feedback through multimodal reasoning, enabling reinforcement learning with high-quality, instruction-consistent gradients. We curate an extended dataset with 190k instruction-reasoning pairs and establish a new benchmark for instruction-based image editing. Experiments show that RetouchIQ substantially improves both semantic consistency and perceptual quality over previous MLLM-based and diffusion-based editing systems. Our findings demonstrate the potential of generalist reward-driven MLLM agents as flexible, explainable, and executable assistants for professional image editing.
翻译:近年来,多模态大语言模型(MLLMs)的进展显示出将视觉-语言推理能力拓展至基于专业工具的图像编辑领域的巨大潜力,实现了直观且富有创造性的编辑。一个颇具前景的方向是利用强化学习(RL)使 MLLMs 能够在专业图像编辑软件中推理并执行最优的工具使用方案。然而,由于缺乏能够反映创造性编辑固有主观性的可靠、可验证的奖励信号,训练过程仍然充满挑战。本文提出 RetouchIQ,这是一个通过由通用奖励模型引导的 MLLM 智能体执行基于指令的可执行图像编辑的框架。RetouchIQ 能够解读用户指定的编辑意图,并生成相应的、可执行的图像调整,从而将高层次的美学目标与精确的参数控制相连接。为了超越传统基于规则、使用手工设计度量指标计算与固定参考图像相似度的奖励机制,我们提出了一种通用奖励模型——一个经过 RL 微调的 MLLM,它能够根据具体情况通过一组生成的度量指标来评估编辑结果。随后,该奖励模型通过多模态推理提供标量反馈,从而实现具有高质量、指令一致性梯度的强化学习。我们构建了一个包含 19 万条指令-推理对的数据集,并建立了一个新的基于指令的图像编辑基准。实验表明,RetouchIQ 在语义一致性和感知质量方面均显著优于以往基于 MLLM 和基于扩散的编辑系统。我们的研究结果证明了通用奖励驱动的 MLLM 智能体作为专业图像编辑领域灵活、可解释且可执行的辅助工具的潜力。