Evaluating image editing models remains challenging due to the coarse granularity and limited interpretability of traditional metrics, which often fail to capture aspects important to human perception and intent. Such metrics frequently reward visually plausible outputs while overlooking controllability, edit localization, and faithfulness to user instructions. In this work, we introduce a fine-grained Multimodal Large Language Model (MLLM)-as-a-Judge framework for image editing that decomposes common evaluation notions into twelve fine-grained interpretable factors spanning image preservation, edit quality, and instruction fidelity. Building on this formulation, we present a new human-validated benchmark that integrates human judgments, MLLM-based evaluations, model outputs, and traditional metrics across diverse image editing tasks. Through extensive human studies, we show that the proposed MLLM judges align closely with human evaluations at a fine granularity, supporting their use as reliable and scalable evaluators. We further demonstrate that traditional image editing metrics are often poor proxies for these factors, failing to distinguish over-edited or semantically imprecise outputs, whereas our judges provide more intuitive and informative assessments in both offline and online settings. Together, this work introduces a benchmark, a principled factorization, and empirical evidence positioning fine-grained MLLM judges as a practical foundation for studying, comparing, and improving image editing approaches.
翻译:评估图像编辑模型仍具挑战性,因为传统指标存在粒度粗糙和可解释性有限的问题,往往无法捕捉对人类感知与意图至关重要的方面。此类指标常奖励视觉上合理的输出,却忽视了可控性、编辑定位性以及对用户指令的忠实度。本文提出一种细粒度的多模态大语言模型即评估器框架,用于图像编辑任务,将常见评估概念分解为十二个细粒度可解释因子,涵盖图像保持度、编辑质量与指令忠实度三个维度。基于此框架,我们构建了一个经过人工验证的新基准,该基准整合了人工判断、基于MLLM的评估、模型输出及跨多种图像编辑任务的传统指标。通过大规模人工实验,我们证明所提出的MLLM评估器在细粒度层面与人类评估高度对齐,支持其作为可靠且可扩展的评估工具。我们进一步证明传统图像编辑指标常难以有效表征这些因子,无法区分过度编辑或语义不精确的输出,而我们的评估器在离线与在线场景中均能提供更直观且信息丰富的评估。综上,本研究通过引入基准体系、理论分解框架及实证依据,确立了细粒度MLLM评估器作为研究、比较和改进图像编辑方法的实践基础。