From Masks to Pixels and Meaning: A New Taxonomy, Benchmark, and Metrics for VLM Image Tampering

Xinyi Shang,Yi Tang,Jiacheng Cui,Ahmed Elhagry,Salwa K. Al Khatib,Sondos Mahmoud Bsharat,Jiacheng Liu,Xiaohan Zhao,Jing-Hao Xue,Hao Li,Salman Khan,Zhiqiang Shen

from arxiv, Code and data at: https://github.com/VILA-Lab/PIXAR (Accepted in CVPR 2026 Findings, but not opted in)

Existing tampering detection benchmarks largely rely on object masks, which severely misalign with the true edit signal: many pixels inside a mask are untouched or only trivially modified, while subtle yet consequential edits outside the mask are treated as natural. We reformulate VLM image tampering from coarse region labels to a pixel-grounded, meaning and language-aware task. First, we introduce a taxonomy spanning edit primitives (replace/remove/splice/inpaint/attribute/colorization, etc.) and their semantic class of tampered object, linking low-level changes to high-level understanding. Second, we release a new benchmark with per-pixel tamper maps and paired category supervision to evaluate detection and classification within a unified protocol. Third, we propose a training framework and evaluation metrics that quantify pixel-level correctness with localization to assess confidence or prediction on true edit intensity, and further measure tamper meaning understanding via semantics-aware classification and natural language descriptions for the predicted regions. We also re-evaluate the existing strong segmentation/localization baselines on recent strong tamper detectors and reveal substantial over- and under-scoring using mask-only metrics, and expose failure modes on micro-edits and off-mask changes. Our framework advances the field from masks to pixels, meanings and language descriptions, establishing a rigorous standard for tamper localization, semantic classification and description. Code and benchmark data are available at https://github.com/VILA-Lab/PIXAR.

翻译：现有篡改检测基准大多依赖对象掩码，这与真实篡改信号严重错位：掩码内部的许多像素未被修改或仅存在轻微改动，而掩码外部虽发生细微但关键的篡改却被视为自然图像。我们重新将VLM图像篡改任务从粗粒度区域标签重塑为基于像素、兼顾语义与语言感知的细粒度任务。首先，提出涵盖篡改基元（替换/移除/拼接/修补/属性更改/着色等）及其语义类别的分类体系，将底层变化与高层理解相连接。其次，发布新基准数据集，包含逐像素篡改图与配对类别监督信息，在统一协议下评估检测与分类性能。第三，提出训练框架与评估指标，通过定位感知的像素级正确性量化真实篡改强度的置信度预测，并借助语义感知分类及预测区域自然语言描述度量篡改语义理解。我们重新评估现有强分割/定位基线在最新篡改检测器上的表现，发现仅使用掩码指标会导致显著的过评分与欠评分，并暴露微编辑及掩码外变化的失败模式。本框架将领域从掩码推进至像素、语义与语言描述，为篡改定位、语义分类与描述建立了严格标准。代码与基准数据见https://github.com/VILA-Lab/PIXAR。