Recent breakthroughs in video generation have demonstrated an emerging capability termed Chain-of-Frames (CoF) reasoning, where models resolve complex tasks through the generation of continuous frames. While these models show promise for Generative Video Reasoning (GVR), existing evaluation frameworks often rely on single-frame assessments, which can lead to outcome-hacking, where a model reaches a correct conclusion through an erroneous process. To address this, we propose a process-aware evaluation paradigm. We introduce VIPER, a comprehensive benchmark spanning 16 tasks across temporal, structural, symbolic, spatial, physics, and planning reasoning. Furthermore, we propose Process-outcome Consistency (POC@r), a new metric that utilizes VLM-as-Judge with a hierarchical rubric to evaluate both the validity of the intermediate steps and the final result. Our experiments reveal that state-of-the-art video models achieve POC@1.0 only about 20% and exhibit a significant outcome-hacking. We further explore the impact of test-time scaling and sampling robustness, highlighting a substantial gap between current video generation and true generalized visual reasoning. Our benchmark are released at https://github.com/RUCAIBox/VIPER.
翻译:近期视频生成领域的突破性进展展示了一种称为帧链推理的新兴能力,即模型通过生成连续帧来解决复杂任务。尽管这些模型在生成式视频推理方面展现出潜力,但现有评估框架通常依赖于单帧评估,这可能导致结果作弊现象——模型通过错误的过程得出正确结论。为解决此问题,我们提出了一种过程感知的评估范式。我们引入了VIPER基准测试,涵盖时间、结构、符号、空间、物理和规划推理等16个任务。此外,我们提出了过程-结果一致性指标,该指标采用具有分层评估标准的视觉语言模型作为评判器,以同时评估中间步骤的有效性和最终结果。实验表明,最先进的视频模型仅能达到约20%的POC@1.0得分,并表现出显著的结果作弊现象。我们进一步探究了测试时扩展和采样鲁棒性的影响,揭示了当前视频生成能力与真正泛化视觉推理之间存在的显著差距。本基准测试已发布于https://github.com/RUCAIBox/VIPER。