In recent years, multimodal image editing models have achieved substantial progress, enabling users to manipulate visual content through natural language in a flexible and interactive manner. Nevertheless, an important yet insufficiently explored research direction remains visual document image editing, which involves modifying textual content within images while faithfully preserving the original text style and background context. Existing approaches, including AnyText, GlyphControl, and TextCtrl, predominantly focus on English-language scenarios and documents with relatively sparse textual layouts, thereby failing to adequately address dense, structurally complex documents or non-Latin scripts such as Chinese. To bridge this gap, we propose \textbf{V}isual \textbf{D}oc \textbf{E}dit Bench(VDE Bench), a rigorously human-annotated and evaluated benchmark specifically designed to assess image editing models on multilingual and complex visual document editing tasks. The benchmark comprises a high-quality dataset encompassing densely textual documents in both English and Chinese, including academic papers, posters, presentation slides, examination materials, and newspapers. Furthermore, we introduce a decoupled evaluation framework that systematically quantifies editing performance at the OCR parsing level, enabling fine-grained assessment of text modification accuracy. Based on this benchmark, we conduct a comprehensive evaluation of representative state-of-the-art image editing models. Manual verification demonstrates a strong consistency between human judgments and automated evaluation metrics. VDE Bench constitutes the first systematic benchmark for evaluating image editing models on multilingual and densely textual visual documents.
翻译:近年来,多模态图像编辑模型取得了显著进展,使用户能够通过自然语言以灵活交互的方式操控视觉内容。然而,视觉文档图像编辑这一重要研究方向仍未得到充分探索,其涉及修改图像中的文本内容,同时需忠实保持原始文本风格与背景上下文。现有方法,包括AnyText、GlyphControl和TextCtrl,主要聚焦于英语场景及文本布局相对稀疏的文档,因而未能充分处理密集、结构复杂的文档或非拉丁文字(如中文)。为填补这一空白,我们提出**V**isual **D**oc **E**dit Bench(VDE Bench),这是一个经过严格人工标注与评估的基准测试,专门用于评估图像编辑模型在多语言及复杂视觉文档编辑任务上的性能。该基准包含一个高质量数据集,涵盖英文和中文的密集文本文档,包括学术论文、海报、演示幻灯片、考试资料及报纸。此外,我们引入一个解耦的评估框架,在OCR解析层面系统量化编辑性能,从而实现对文本修改准确性的细粒度评估。基于此基准,我们对代表性的前沿图像编辑模型进行了全面评估。人工验证表明,人工判断与自动评估指标之间具有高度一致性。VDE Bench构成了首个用于评估图像编辑模型在多语言及密集文本视觉文档上性能的系统性基准。