Recent advancements in diffusion-based video generation have showcased remarkable results, yet the gap between synthetic and real-world videos remains under-explored. In this study, we examine this gap from three fundamental perspectives: appearance, motion, and geometry, comparing real-world videos with those generated by a state-of-the-art AI model, Stable Video Diffusion. To achieve this, we train three classifiers using 3D convolutional networks, each targeting distinct aspects: vision foundation model features for appearance, optical flow for motion, and monocular depth for geometry. Each classifier exhibits strong performance in fake video detection, both qualitatively and quantitatively. This indicates that AI-generated videos are still easily detectable, and a significant gap between real and fake videos persists. Furthermore, utilizing the Grad-CAM, we pinpoint systematic failures of AI-generated videos in appearance, motion, and geometry. Finally, we propose an Ensemble-of-Experts model that integrates appearance, optical flow, and depth information for fake video detection, resulting in enhanced robustness and generalization ability. Our model is capable of detecting videos generated by Sora with high accuracy, even without exposure to any Sora videos during training. This suggests that the gap between real and fake videos can be generalized across various video generative models. Project page: https://justin-crchang.github.io/3DCNNDetection.github.io/
翻译:基于扩散的视频生成技术近期取得了显著进展,然而合成视频与真实视频之间的差异仍未得到充分探索。本研究从三个基本视角——外观、运动和几何——考察了这一差异,将真实视频与最先进的AI模型Stable Video Diffusion生成的视频进行比较。为此,我们使用三维卷积网络训练了三个分类器,分别针对不同方面:基于视觉基础模型特征的外观分类器、基于光流的运动分类器以及基于单目深度估计的几何分类器。每个分类器在伪造视频检测方面均表现出色,定性与定量评估结果一致。这表明AI生成的视频仍易于被检测,真实视频与伪造视频之间仍存在显著差距。进一步地,通过Grad-CAM方法,我们精准定位了AI生成视频在外观、运动和几何方面的系统性缺陷。最后,我们提出了一种专家集成模型,该模型整合外观、光流和深度信息进行伪造视频检测,从而提升了模型的鲁棒性与泛化能力。我们的模型能够以高精度检测Sora生成的视频,即使在训练过程中未接触任何Sora视频样本。这表明真实视频与伪造视频之间的差异在不同视频生成模型中具有普适性。项目页面:https://justin-crchang.github.io/3DCNNDetection.github.io/