As generative video models become increasingly realistic, detecting AI-generated videos requires systems that offer both accuracy and interpretability. However, applying Multimodal Large Language Models (MLLMs) to video forensics is currently limited by outdated datasets, simplistic evaluation protocols, and a reliance on black-box classification. To address these issues, we introduce a comprehensive dataset, benchmark, and baseline model for video forgery detection. First, we present \textbf{GenBuster-200K}, a fair dataset of over 200,000 high-quality videos sourced from state-of-the-art generators, featuring diverse real-world scenarios. Second, we propose \textbf{GenBuster-Bench}, a diagnostic benchmark spanning three progressive tracks (In-Domain, Out-of-Domain, and In-the-Wild) to evaluate models across \textit{domain shifts} and \textit{generational shifts}. It also introduces an MLLM-as-a-Judge protocol to assess the quality of the generated forensic explanations. Finally, we develop \textbf{BusterX}, an MLLM baseline with RL training. Instead of direct binary classification, BusterX formulates detection as a visual reasoning task, where the generated reasoning chain serves as detector itself. Experimental results demonstrate that BusterX outperforms several leading MLLMs (e.g., Qwen3.5, Claude-Sonnet-4.6) in both detection accuracy and rationale quality.
翻译:随着生成式视频模型日益逼真,检测AI生成视频需要兼具准确性与可解释性的系统。然而,当前将多模态大语言模型应用于视频取证存在三大局限:数据集过时、评估协议单一,以及依赖黑箱分类方法。为应对这些挑战,我们提出了一个综合性数据集、基准测试框架及基线模型。首先,构建了包含超20万条高质量视频的公平数据集\textbf{GenBuster-200K},视频源自最新生成式模型并覆盖多样化真实场景。其次,设计诊断性基准测试\textbf{GenBuster-Bench},包含领域内、跨领域及野外三条渐进式评测轨道,可评估模型在\textit{领域迁移}与\textit{代际迁移}场景下的表现,并引入多模态大模型作为裁决者的协议以评估生成取证说明的质量。最后,开发了基于强化学习的多模态大语言模型基线\textbf{BusterX}。不同于直接二元分类,BusterX将检测任务构建为视觉推理过程,其生成的推理链本身即构成检测器。实验结果表明,BusterX在检测准确率与依据质量上均优于多个主流多模态大模型(如Qwen3.5、Claude-Sonnet-4.6)。