With the rapid advancement of video generation models, distinguishing between AI-generated and authentic videos has emerged as a challenging endeavor. The majority of existing research endeavors concentrate on the development of detectors for identifying samples generated by generative adversarial networks. Nevertheless, the detection of AI-generated videos, particularly those produced by text-to-video models, still remains an uncharted territory. Although state-of-the-art text-to-video models can generate realistic visual content similar to real videos, they fall short of generating the details of the images and the changes in details within the videos. Inspired by this, we address AI-generated video detection from a novel perspective of bit-planes, which can effectively describe the details or noises in images or videos. To this end, we propose a simple yet effective approach called Noise Amplification. This approach first extracts noise signals based on bit-planes, then amplifies these noise signals, and finally feeds them into the discriminator networks for video fake classification. Noise amplification is comprehensively constructed by incorporating three aspects: pixel-level intensity enhancement, region-level spatial amplification, and frame-level temporal aggregation. To evaluate methods of AI-generated video detection in challenging scenarios, we also introduce a benchmark named HardGVD. Extensive experiments on both the large-scale dataset GenVidBench and HardGVD show that our simple approach significantly outperforms state-of-the-art methods.
翻译:随着视频生成模型的快速发展,区分AI生成视频与真实视频已成为一项具有挑战性的任务。现有研究大多集中于开发用于检测生成对抗网络所生成样本的检测器。然而,AI生成视频的检测,尤其是由文本到视频模型生成的视频,仍是一个未知领域。尽管最先进的文本到视频模型能够生成与真实视频相似的逼真视觉内容,但它们在生成图像细节以及视频中细节变化方面仍存在不足。受此启发,我们从位平面的新视角来解决AI生成视频的检测问题,位平面能够有效描述图像或视频中的细节或噪声。为此,我们提出一种简单而有效的方法,称为噪声放大。该方法首先基于位平面提取噪声信号,然后放大这些噪声信号,最后将其输入判别器网络以进行视频真伪分类。噪声放大通过整合像素级强度增强、区域级空间放大和帧级时间聚合三个方面来全面构建。为了评估在具有挑战性的场景下AI生成视频检测方法,我们还引入了一个名为HardGVD的基准。在大型数据集GenVidBench和HardGVD上的广泛实验表明,我们提出的简单方法显著优于最先进的方法。