The growing capability of video generation poses escalating security risks, making reliable detection increasingly essential. In this paper, we introduce VideoVeritas, a framework that integrates fine-grained perception and fact-based reasoning. We observe that while current multi-modal large language models (MLLMs) exhibit strong reasoning capacity, their granular perception ability remains limited. To mitigate this, we introduce Joint Preference Alignment and Perception Pretext Reinforcement Learning (PPRL). Specifically, rather than directly optimizing for detection task, we adopt general spatiotemporal grounding and self-supervised object counting in the RL stage, enhancing detection performance with simple perception pretext tasks. To facilitate robust evaluation, we further introduce MintVid, a light yet high-quality dataset containing 3K videos from 9 state-of-the-art generators, along with a real-world collected subset that has factual errors in content. Experimental results demonstrate that existing methods tend to bias towards either superficial reasoning or mechanical analysis, while VideoVeritas achieves more balanced performance across diverse benchmarks.
翻译:视频生成能力的不断增强带来了日益严峻的安全风险,使得可靠的检测变得愈发重要。本文提出VideoVeritas框架,该框架整合了细粒度感知与基于事实的推理。我们观察到,尽管当前的多模态大语言模型(MLLMs)展现出强大的推理能力,但其细粒度感知能力仍然有限。为缓解此问题,我们提出了联合偏好对齐与感知预任务强化学习(PPRL)。具体而言,我们并非直接针对检测任务进行优化,而是在强化学习阶段采用通用的时空定位与自监督物体计数方法,通过简单的感知预任务来提升检测性能。为促进稳健评估,我们进一步提出了MintVid数据集——一个轻量但高质量的包含来自9个前沿生成器的3K视频的数据集,以及一个在内容上存在事实错误、从真实世界收集的子集。实验结果表明,现有方法往往偏向于浅层推理或机械分析,而VideoVeritas在多样化基准测试中实现了更为均衡的性能。