The rapid advancement of AI-generated multimodal video-audio content has raised significant concerns regarding information security and content authenticity. Existing synthetic video datasets predominantly focus on the visual modality alone, while the few incorporating audio are largely confined to facial deepfakes--a limitation that fails to address the expanding landscape of general multimodal AI-generated content and substantially impedes the development of trustworthy detection systems. To bridge this critical gap, we introduce the Multimodal Video-Audio Dataset (MVAD), the first comprehensive dataset specifically designed for detecting AI-generated multimodal video-audio content. Our dataset exhibits three key characteristics: (1) genuine multimodality with samples generated according to three realistic video-audio forgery patterns; (2) high perceptual quality achieved through diverse state-of-the-art generative models; and (3) comprehensive diversity spanning realistic and anime visual styles, four content categories (humans, animals, objects, and scenes), and four video-audio multimodal data types. Our dataset will be available at https://github.com/HuMengXue0104/MVAD.
翻译:人工智能生成的多模态视频-音频内容的快速发展,对信息安全和内容真实性引发了重大关切。现有合成视频数据集主要关注视觉模态,而少数包含音频的数据集也大多局限于面部深度伪造——这种局限未能覆盖日益扩展的通用多模态AI生成内容领域,并显著阻碍了可信检测系统的发展。为弥合这一关键差距,我们提出多模态视频-音频数据集(MVAD),这是首个专门用于检测AI生成多模态视频-音频内容的综合性数据集。该数据集具有三个关键特征:(1)真实的 multimodality(多模态性),样本依据三种逼真的视频-音频伪造模式生成;(2)通过多种最先进的生成模型实现的高感知质量;(3)涵盖现实与动漫视觉风格、四类内容(人类、动物、物体与场景)以及四种视频-音频多模态数据类型的全面多样性。本数据集将于 https://github.com/HuMengXue0104/MVAD 提供。