Existing tampering detection benchmarks largely rely on object masks, which severely misalign with the true edit signal: many pixels inside a mask are untouched or only trivially modified, while subtle yet consequential edits outside the mask are treated as natural. We reformulate VLM image tampering from coarse region labels to a pixel-grounded, meaning and language-aware task. First, we introduce a taxonomy spanning edit primitives (replace/remove/splice/inpaint/attribute/colorization, etc.) and their semantic class of tampered object, linking low-level changes to high-level understanding. Second, we release a new benchmark with per-pixel tamper maps and paired category supervision to evaluate detection and classification within a unified protocol. Third, we propose a training framework and evaluation metrics that quantify pixel-level correctness with localization to assess confidence or prediction on true edit intensity, and further measure tamper meaning understanding via semantics-aware classification and natural language descriptions for the predicted regions. We also re-evaluate the existing strong segmentation/localization baselines on recent strong tamper detectors and reveal substantial over- and under-scoring using mask-only metrics, and expose failure modes on micro-edits and off-mask changes. Our framework advances the field from masks to pixels, meanings and language descriptions, establishing a rigorous standard for tamper localization, semantic classification and description. Code and benchmark data are available at https://github.com/VILA-Lab/PIXAR.
翻译:现有篡改检测基准大多依赖对象掩码,这与真实篡改信号严重错位:掩码内部的许多像素未被修改或仅存在轻微改动,而掩码外部虽发生细微但关键的篡改却被视为自然图像。我们重新将VLM图像篡改任务从粗粒度区域标签重塑为基于像素、兼顾语义与语言感知的细粒度任务。首先,提出涵盖篡改基元(替换/移除/拼接/修补/属性更改/着色等)及其语义类别的分类体系,将底层变化与高层理解相连接。其次,发布新基准数据集,包含逐像素篡改图与配对类别监督信息,在统一协议下评估检测与分类性能。第三,提出训练框架与评估指标,通过定位感知的像素级正确性量化真实篡改强度的置信度预测,并借助语义感知分类及预测区域自然语言描述度量篡改语义理解。我们重新评估现有强分割/定位基线在最新篡改检测器上的表现,发现仅使用掩码指标会导致显著的过评分与欠评分,并暴露微编辑及掩码外变化的失败模式。本框架将领域从掩码推进至像素、语义与语言描述,为篡改定位、语义分类与描述建立了严格标准。代码与基准数据见https://github.com/VILA-Lab/PIXAR。