The emergence of artificial intelligence-generated content (AIGC) has raised concerns about the authenticity of multimedia content in various fields. However, existing research for forgery content detection has focused mainly on binary classification tasks of complete videos, which has limited applicability in industrial settings. To address this gap, we propose UMMAFormer, a novel universal transformer framework for temporal forgery localization (TFL) that predicts forgery segments with multimodal adaptation. Our approach introduces a Temporal Feature Abnormal Attention (TFAA) module based on temporal feature reconstruction to enhance the detection of temporal differences. We also design a Parallel Cross-Attention Feature Pyramid Network (PCA-FPN) to optimize the Feature Pyramid Network (FPN) for subtle feature enhancement. To evaluate the proposed method, we contribute a novel Temporal Video Inpainting Localization (TVIL) dataset specifically tailored for video inpainting scenes. Our experiments show that our approach achieves state-of-the-art performance on benchmark datasets, including Lav-DF, TVIL, and Psynd, significantly outperforming previous methods. The code and data are available at https://github.com/ymhzyj/UMMAFormer/.
翻译:人工智能生成内容(AIGC)的兴起引发了各领域多媒体内容真实性的担忧。然而,现有伪造内容检测研究主要聚焦于完整视频的二分类任务,在工业场景中适用性有限。为填补这一空白,我们提出UMMAFormer——一种新颖的通用Transformer框架,通过多模态自适应实现时序伪造定位(TFL)。该方法引入基于时序特征重构的时序特征异常注意力(TFAA)模块,增强对时序差异的检测能力。同时设计并行交叉注意力特征金字塔网络(PCA-FPN),优化特征金字塔网络(FPN)以增强细微特征。为评估所提方法,我们专门针对视频修复场景构建了时序视频修复定位(TVIL)数据集。实验表明,本方法在Lav-DF、TVIL和Psynd等基准数据集上均达到最优性能,显著超越既往方法。代码与数据见https://github.com/ymhzyj/UMMAFormer/。