ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks

Samin Mahdizadeh Sani,Max Ku,Nima Jamali,Matina Mahdizadeh Sani,Paria Khoshtab,Wei-Chieh Sun,Parnian Fazel,Zhi Rui Tam,Thomas Chong,Edisy Kin Wai Chan,Donald Wai Tong Tsang,Chiao-Wei Hsu,Ting Wai Lam,Ho Yin Sam Ng,Chiafeng Chu,Chak-Wing Mak,Keming Wu,Hiu Tung Wong,Yik Chun Ho,Chi Ruan,Zhuofeng Li,I-Sheng Fang,Shih-Ying Yeh,Ho Kei Cheng,Ping Nie,Wenhu Chen

from arxiv, Published in ICLR 2026

Advances in diffusion, autoregressive, and hybrid models have enabled high-quality image synthesis for tasks such as text-to-image, editing, and reference-guided composition. Yet, existing benchmarks remain limited, either focus on isolated tasks, cover only narrow domains, or provide opaque scores without explaining failure modes. We introduce \textbf{ImagenWorld}, a benchmark of 3.6K condition sets spanning six core tasks (generation and editing, with single or multiple references) and six topical domains (artworks, photorealistic images, information graphics, textual graphics, computer graphics, and screenshots). The benchmark is supported by 20K fine-grained human annotations and an explainable evaluation schema that tags localized object-level and segment-level errors, complementing automated VLM-based metrics. Our large-scale evaluation of 14 models yields several insights: (1) models typically struggle more in editing tasks than in generation tasks, especially in local edits. (2) models excel in artistic and photorealistic settings but struggle with symbolic and text-heavy domains such as screenshots and information graphics. (3) closed-source systems lead overall, while targeted data curation (e.g., Qwen-Image) narrows the gap in text-heavy cases. (4) modern VLM-based metrics achieve Kendall accuracies up to 0.79, approximating human ranking, but fall short of fine-grained, explainable error attribution. ImagenWorld provides both a rigorous benchmark and a diagnostic tool to advance robust image generation.

翻译：扩散模型、自回归模型及混合模型的进展，已使文生图、编辑与参考引导合成等任务的高质量图像生成成为可能。然而，现有基准仍存在局限性：或聚焦孤立任务，或仅覆盖狭窄领域，或仅提供不透明的评分而未能解释失败模式。我们提出**ImagenWorld**——一个包含3600组条件集的基准，覆盖六大核心任务（单参考/多参考的生成与编辑）和六大主题领域（艺术品、逼真图像、信息图表、文本图形、计算机图形及屏幕截图）。该基准由2万条细粒度人工注释及可解释评估方案支持，通过标注局部对象级和片段级错误来补充基于自动VLM的指标。我们对14个模型的大规模评估揭示了多项见解：（1）模型在编辑任务中的表现通常比生成任务更困难，尤其是在局部编辑中；（2）模型在艺术与逼真场景中表现优异，但在屏幕截图与信息图表等符号化及文本密集领域表现欠佳；（3）闭源系统整体领先，但针对性数据筛选（如Qwen-Image）缩小了文本密集场景下的差距；（4）现代基于VLM的指标Kendall相关性达0.79，可近似人类排序，但在细粒度、可解释的错误归因方面仍有不足。ImagenWorld既是严谨的基准测试，也是推动鲁棒图像生成的诊断工具。