BusterX++: Towards Unified Cross-Modal AI-Generated Content Detection and Explanation with MLLM

The rapid advancement of generative AI has substantially improved image and video synthesis, amplifying the risk of multimodal visual misinformation. Recent MLLMs have shown promise for transparent AI-generated content detection through reasoning and explanation, yet existing approaches largely treat image and video forensics as isolated tasks, leaving cross-modal synergies underexplored. To address this, we present \textbf{BusterX++}, a unified MLLM for joint image and video detection with interpretable reasoning. We also introduce \textbf{GenBuster-Bench++}, a meticulously curated, difficulty-aligned benchmark containing balanced image and video samples spanning recent generation models and diverse real-world scenarios. Using this controlled setting, we revisit the widely adopted $SFT \rightarrow RL$ post-training paradigm. Notably, our findings demonstrate that a single-stage, pure RL strategy driven strictly by sparse outcome rewards consistently matches or surpasses a strong SFT+RL baseline across both unified and single-modality settings. Our key insight reveals that SFT imposes lower policy entropy, which restricts the policy search space and dampens exploratory freedom. In contrast, single-stage pure RL maintains higher policy entropy throughout training, effectively unlocking the spontaneous emergence of cross-modal capability transfer between image and video forensics. Extensive experiments demonstrate that BusterX++ achieves state-of-the-art performance, highlighting the powerful potential of RL for unified cross-modal visual reasoning.

翻译：生成式AI的快速发展显著提升了图像与视频合成质量，同时加剧了多模态视觉虚假信息的风险。近期多模态大语言模型（MLLM）通过推理与解释策略在透明化AI生成内容检测方面展现出潜力，但现有方法仍将图像与视频取证视为独立任务，跨模态协同作用尚未得到充分探索。为此，本文提出**BusterX++**——一种统一的多模态大语言模型，可同时进行图像与视频检测并实现可解释推理。此外，我们构建了**GenBuster-Bench++**基准数据集，该数据集经过精心设计，包含难度对齐的平衡样本，涵盖最新生成模型及多样化真实场景。基于这一受控实验环境，我们重新审视了广泛采用的$SFT \rightarrow RL$后训练范式。值得注意的是，实验表明：采用纯强化学习（RL）策略的单阶段训练——仅依赖稀疏结果奖励——在统一模态与单模态设置下均能稳定达到或超越基于强SFT+RL的基线方法。关键洞察在于，监督微调（SFT）会降低策略熵值，限制策略搜索空间并抑制探索自由度；而单阶段纯RL训练全程维持较高策略熵，有效激发了图像与视频取证间跨模态能力迁移的自发涌现。大量实验证明，BusterX++实现了最先进性能，彰显了RL在统一跨模态视觉推理中的强大潜力。