CMTA: Leveraging Cross-Modal Temporal Artifacts for Generalizable AI-Generated Video Detection

The proliferation of advanced AI video synthesis techniques poses an unprecedented challenge to digital video authenticity. Existing AI-generated video (AIGV) detection methods primarily focus on uni-modal or spatiotemporal artifacts, but they overlook the rich cues within the visual-textual cross-modal space, especially the temporal stability of semantic alignment. In this work, we identify a distinctive fingerprint in AIGVs, termed cross-modal temporal artifact (CMTA). Unlike real videos that exhibit natural temporal fluctuations in cross-modal alignment due to semantic variations, AIGVs display unnaturally stable semantic trajectories governed by given input prompts. To bridge this gap, we propose the CMTA framework, a cross-modal detection approach that captures these unique temporal artifacts through joint cross-modal embedding and multi-grained temporal modeling. Specifically, CMTA leverages BLIP to generate frame-level image captions and utilizes CLIP to extract corresponding visual-textual representations. A coarse-grained temporal modeling branch is then designed to characterize temporal fluctuations in cross-modal alignment with a GRU. In parallel, a fine-grained branch is constructed to capture intricate inter-frame variations from integrated visual-textual features with a Transformer encoder. Extensive experiments on 40 subsets across four large-scale datasets, including GenVideo, EvalCrafter, VideoPhy, and VidProM, validate that our approach sets a new state-of-the-art while exhibiting superior cross-generator generalization. Code and models of CMTA will be released at https://github.com/hwang-cs-ime/CMTA

翻译：先进AI视频合成技术的飞速发展对数字视频真实性提出了空前挑战。现有AI生成视频（AIGV）检测方法主要聚焦于单模态或时空伪迹，却忽视了视觉-文本跨模态空间中丰富的线索，尤其是语义对齐的时间稳定性。本文发现AIGV中存在一种独特的指纹特征——跨模态时间伪迹（CMTA）。与真实视频因语义变化而产生自然跨模态对齐时间波动不同，AIGV在输入提示词驱动下呈现出非自然的稳定语义轨迹。为弥补这一差距，我们提出CMTA框架——一种通过联合跨模态嵌入与多粒度时间建模捕获此类独特时间伪迹的跨模态检测方法。具体而言，CMTA利用BLIP生成帧级图像描述，并借助CLIP提取对应的视觉-文本表征。随后设计粗粒度时间建模分支，通过GRU表征跨模态对齐的时间波动性；同时构建细粒度分支，利用Transformer编码器从融合的视觉-文本特征中捕获复杂的帧间差异。在包含GenVideo、EvalCrafter、VideoPhy和VidProM四个大规模数据集共40个子集上的大量实验表明，本方法在实现优越跨生成器泛化能力的同时树立了新的最优性能。CMTA代码与模型将发布于https://github.com/hwang-cs-ime/CMTA