An Extensive Benchmark for Single-round and Multi-round Instruction-based Image Editing

In recent years, there have been notable advancements in the area of instruction-based image editing (IIE), which focuses on the automatic alteration of input images using a model. Nevertheless, assessing the effectiveness of these editing models poses a considerable challenge due to the intricate nature of instructions and the wide variety of edits. To tackle this problem, one urgent task in this domain is the development of a robust evaluation framework that can precisely gauge the quality of editing outcomes and offer valuable benchmarks to guide future improvements. To address this challenge, we present a comprehensive evaluation benchmark named I2EBench2.0, designed for single-round and multi-round assessment of IIE models. I2EBench2.0 has four key features: 1) Evaluation Across Single and Multi-rounds: I2EBench2.0 simultaneously evaluates both single-round and multi-round instruction-based edits, assessing the precision and consistency of the edits. 2) Extensive Evaluation Criteria: I2EBench2.0 encompasses a broad range of criteria, evaluating both high-level and low-level aspects of each IIE model. Specifically, it incorporates 16 dimensions for single-round evaluations and 7 for multi-round evaluations. 3) Alignment with Human Judgment: To ensure our benchmark aligns with human evaluation, we conducted a comprehensive user study for each criterion. 4) Research-driven Insights: By analyzing the strengths and weaknesses of current IIE models across all 16 single-round and 7 multi-round dimensions, we provide critical insights aimed at directing future research in this area. We tested eight recently developed IIE models using I2EBench2.0 and derived academic insights through meticulous comparison and analysis. The related code, dataset, and images generated by all IIE models are available on GitHub: https://github.com/cocoshe/I2EBench.

翻译：近年来，基于指令的图像编辑（IIE）领域取得了显著进展，该技术专注于利用模型自动修改输入图像。然而，由于指令的复杂性和编辑类型的多样性，评估这些编辑模型的有效性构成了巨大挑战。为解决这一问题，该领域的迫切任务之一是开发一个稳健的评估框架，以精确衡量编辑结果质量并提供有价值的基准，从而指导未来的改进。为此，我们提出了一个名为I2EBench2.0的综合评估基准，专为IIE模型的单轮与多轮评估设计。I2EBench2.0具有四个关键特征：1）跨单轮与多轮评估：I2EBench2.0同时评估基于指令的单轮与多轮编辑，衡量编辑的精确性和一致性。2）广泛评估标准：I2EBench2.0涵盖广泛的标准，评估每个IIE模型的高层与低层性能。具体而言，单轮评估包含16个维度，多轮评估包含7个维度。3）与人类判断对齐：为确保基准与人类评估一致，我们针对每个标准开展了全面的用户研究。4）研究驱动的洞察：通过分析当前IIE模型在所有16个单轮维度和7个多轮维度上的优缺点，我们提供了关键见解，旨在引导该领域的未来研究方向。我们使用I2EBench2.0测试了八个最新开发的IIE模型，并通过细致的对比分析得出了学术洞察。相关代码、数据集及所有IIE模型生成的图像已发布于GitHub：https://github.com/cocoshe/I2EBench。