Recent advances in generative AI have dramatically improved image and video synthesis capabilities, significantly increasing the risk of misinformation through sophisticated fake content. In response, detection methods have evolved from traditional approaches to multimodal large language models (MLLMs), offering enhanced transparency and interpretability in identifying synthetic media. However, current detection systems remain fundamentally limited by their single-modality design. These approaches analyze images or videos separately, making them ineffective against synthetic content that combines multiple media formats. To address these challenges, we introduce \textbf{BusterX++}, a framework for unified detection and explanation of synthetic image and video, with a direct reinforcement learning (RL) post-training strategy. To enable comprehensive evaluation, we also present \textbf{GenBuster++}, a unified benchmark leveraging state-of-the-art image and video generation techniques. This benchmark comprises 4,000 images and video clips, meticulously curated by human experts to ensure high quality, diversity, and real-world applicability. Extensive experiments demonstrate the effectiveness and generalizability of our approach.
翻译:生成式人工智能的最新进展极大地提升了图像与视频合成能力,显著增加了通过复杂伪造内容传播虚假信息的风险。作为应对,检测方法已从传统方法演进至多模态大语言模型(MLLM),为识别合成媒体提供了更强的透明度和可解释性。然而,当前检测系统本质上仍受限于其单模态设计。这些方法分别分析图像或视频,使其难以应对融合多种媒体格式的合成内容。为应对这些挑战,我们提出\textbf{BusterX++}框架,该框架通过直接强化学习(RL)后训练策略,实现对合成图像与视频的统一检测与解释。为进行全面评估,我们还提出了\textbf{GenBuster++}基准,该基准利用最先进的图像与视频生成技术构建,包含由人类专家精心筛选的4,000张图像及视频片段,确保高质量、多样性和现实适用性。大量实验证明了我们方法的有效性和泛化能力。