While current multimodal models are proficient at open-ended visual editing, executing precise single-answer edits remains an important obstacle. To probe this challenge, we introduce PaintBench, a dynamically scalable benchmark targeting 20 fundamental precise visual editing operations across four categories: geometric transformation, structural manipulation, color change, and symbolic reasoning. Procedural generation with configurable complexity enables an effectively infinite, contamination-resistant evaluation suite, and deterministic pixel-level evaluation eliminates reliance on bias-prone judge models. Across 11 image editing models, we find overall low performance, with the current highest-performing industry leader scoring only 17.1% (mIoU). Task decomposition reveals especially challenging operation types (geometric transformation, most structural manipulation, formula-based color change) and model-specific specializations. Fine-grained benchmark diagnostics further show performance degradations induced by scene variations in object count, background complexity, color scheme, and edit-region size. To test generalization of PaintBench scores to applied task performance, we create a procedural, deterministic evaluation for data visualization editing (TinyGrafixBench) and find strong linear correlation with PaintBench scores ($R^2 = 0.91$, $p < 0.001$). Altogether, PaintBench provides a rigorous foundation for measuring and driving progress in precise multimodal visual editing.
翻译:当前多模态模型在开放式视觉编辑任务中表现优异,但执行精确的单答案编辑仍是一项重要挑战。为探究该问题,我们提出了PaintBench——一个动态可扩展的基准测试,聚焦于四大类共20种基础精确视觉编辑操作:几何变换、结构操控、颜色变化与符号推理。通过可配置复杂度的程序化生成,该基准实现近乎无限且抗污染的评估套件,并采用确定性像素级评估以避免依赖易偏倚的评判模型。在对11个图像编辑模型的测试中,我们发现整体性能较低:当前性能最优的行业领先模型仅取得17.1%(mIoU)的得分。任务分解揭示了特别困难的操作类型(几何变换、多数结构操控、基于公式的颜色变化)及模型特异性专长。细粒度基准诊断进一步表明,目标数量、背景复杂度、配色方案及编辑区域大小等场景变化会导致性能退化。为验证PaintBench得分向应用任务性能的泛化能力,我们构建了面向数据可视化编辑的程序化确定性评估基准(TinyGrafixBench),发现其与PaintBench得分呈现强线性相关($R^2 = 0.91$,$p < 0.001$)。综上,PaintBench为衡量与推动精确多模态视觉编辑的进展奠定了严谨基础。