Can We Get Rid of Handcrafted Feature Extractors? SparseViT: Nonsemantics-Centered, Parameter-Efficient Image Manipulation Localization through Spare-Coding Transformer

2024 年 12 月 23 日

翻译：能否摒弃手工特征提取器？SparseViT：基于稀疏编码Transformer的非语义中心化、参数高效的图像篡改定位

Lei Su,Xiaochen Ma,Xuekang Zhu,Chaoqun Niu,Zeyu Lei,Ji-Zhe Zhou

from arxiv, published to AAAI2025

Non-semantic features or semantic-agnostic features, which are irrelevant to image context but sensitive to image manipulations, are recognized as evidential to Image Manipulation Localization (IML). Since manual labels are impossible, existing works rely on handcrafted methods to extract non-semantic features. Handcrafted non-semantic features jeopardize IML model's generalization ability in unseen or complex scenarios. Therefore, for IML, the elephant in the room is: How to adaptively extract non-semantic features? Non-semantic features are context-irrelevant and manipulation-sensitive. That is, within an image, they are consistent across patches unless manipulation occurs. Then, spare and discrete interactions among image patches are sufficient for extracting non-semantic features. However, image semantics vary drastically on different patches, requiring dense and continuous interactions among image patches for learning semantic representations. Hence, in this paper, we propose a Sparse Vision Transformer (SparseViT), which reformulates the dense, global self-attention in ViT into a sparse, discrete manner. Such sparse self-attention breaks image semantics and forces SparseViT to adaptively extract non-semantic features for images. Besides, compared with existing IML models, the sparse self-attention mechanism largely reduced the model size (max 80% in FLOPs), achieving stunning parameter efficiency and computation reduction. Extensive experiments demonstrate that, without any handcrafted feature extractors, SparseViT is superior in both generalization and efficiency across benchmark datasets.

翻译：非语义特征（或称语义无关特征）与图像内容无关但对图像篡改敏感，已被证实对图像篡改定位（IML）具有证据价值。由于无法获得人工标注数据，现有研究依赖手工方法提取非语义特征。手工设计的非语义特征会损害IML模型在未见或复杂场景中的泛化能力。因此，IML领域亟待解决的核心问题是：如何自适应地提取非语义特征？非语义特征具有内容无关性与篡改敏感性，即在单张图像中，除非发生篡改，这些特征在不同图像块间保持一致性。因此，仅需图像块间稀疏且离散的交互作用即可提取非语义特征。然而，图像语义在不同区域间差异显著，需要密集且连续的块间交互来学习语义表征。为此，本文提出稀疏视觉Transformer（SparseViT），将ViT中稠密的全局自注意力机制重构为稀疏离散形式。这种稀疏自注意力机制能打破图像语义连续性，迫使SparseViT自适应提取图像的非语义特征。此外，与现有IML模型相比，稀疏自注意力机制大幅缩减了模型规模（最高减少80%FLOPs），实现了显著的参数效率提升与计算量削减。大量实验表明，在不使用任何手工特征提取器的情况下，SparseViT在多个基准数据集上均展现出卓越的泛化能力与运行效率。