VDE Bench: Evaluating The Capability of Image Editing Models to Modify Visual Documents

Hongzhu Yi,Yujia Yang,Yuanxiang Wang,Tong Li,Zhenyu Guan,Tianyu Zong,Jiahuan Chen,Chenxi Bao,Tiankun Yang,Haopeng Jin,Yixuan Yuan,Xinming Wang,Tao Yu,Ruilin Gao,Ruiwen Tao,Haijin Liang,Jin Ma,Jinwen Luo, Yeshani,Xinyu Zuo,Jungang Xu

In recent years, image editing models have made significant progress, enabling users to manipulate visual content in a flexible and interactive manner through natural language instructions. However, an important yet underexplored research direction remains dense visual document image editing, which involves modifying textual content within images while faithfully preserving the original text style and background context. Existing methods primarily focus on English scenarios and images with relatively sparse text, and thus cannot adequately address dense, structurally complex documents or non-Latin scripts such as Chinese. To bridge this gap, we propose VDE Bench (Visual Doc Edit Bench), a rigorously human annotated and evaluated benchmark specifically designed to assess the performance of image editing models on bilingual Chinese-English and complex visual document editing tasks. The benchmark comprises a high quality dataset of 942 instruction based image editing samples, whose seed images encompass dense Chinese and English text documents including academic papers, posters, presentation slides, examination materials, and newspapers. Furthermore, we introduce a novel evaluation framework that systematically quantifies editing performance at the OCR parsing level, thereby enabling fine grained assessment of text modification accuracy. Based on this benchmark, we conduct a comprehensive evaluation of representative image editing models. Human verification demonstrates a high degree of consistency between human judgments and automated evaluation metrics. VDE Bench constitutes the first systematic benchmark for evaluating the performance of image editing models on bilingual dense text visual documents.

翻译：近年来，图像编辑模型取得了显著进展，使用户能够通过自然语言指令以灵活交互的方式操控视觉内容。然而，一个重要且尚待深入探索的研究方向——密集视觉文档图像编辑——仍待突破，该任务涉及在忠实保留原始文字风格与背景上下文的前提下修改图像中的文本内容。现有方法主要聚焦于英文场景及文字稀疏的图像，难以应对密集且结构复杂的文档，或非拉丁语系如中文等语种。为填补这一空白，我们提出VDE Bench（视觉文档编辑基准），这是一个经严格人工标注与评估的专用基准，旨在评估图像编辑模型在中英双语及复杂视觉文档编辑任务上的性能。该基准包含942个基于指令的图像编辑样本组成的高质量数据集，其原始图像涵盖含密集中文与英文文本的文档，包括学术论文、海报、演示文稿、考试材料及报纸等。此外，我们引入了一套新型评估框架，可在OCR解析层面系统量化编辑性能，从而实现对文本修改精度的细粒度评估。基于该基准，我们对代表性图像编辑模型进行了全面评测。人工验证表明，人类判断与自动化评估指标之间具有高度一致性。VDE Bench是首个系统评估图像编辑模型在双语密集文本视觉文档上性能的基准。