Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark

While Chain-of-Thought (CoT) prompting enables sophisticated symbolic reasoning in LLMs, it remains confined to discrete text and cannot simulate the continuous, physics-governed dynamics of the real world. Recent video generation models have emerged as potential world simulators through Chain-of-Frames (CoF) reasoning -- materializing thought as frame-by-frame visual sequences, with each frame representing a physically-grounded reasoning step. Despite compelling demonstrations, a challenge persists: existing benchmarks, focusing on fidelity or alignment, do not assess CoF reasoning and thus cannot measure core cognitive abilities in multi-step planning, algorithmic logic, or abstract pattern extrapolation. This evaluation void prevents systematic understanding of model capabilities and principled guidance for improvement. We introduce Gen-ViRe (Generative Visual Reasoning Benchmark), a framework grounded in cognitive science and real-world AI applications, which decomposes CoF reasoning into six cognitive dimensions -- from perceptual logic to abstract planning -- and 24 subtasks. Through multi-source data curation, minimal prompting protocols, and hybrid VLM-assisted evaluation with detailed criteria, Gen-ViRe delivers the first quantitative assessment of video models as reasoners. Our experiments on SOTA systems reveal substantial discrepancies between impressive visual quality and actual reasoning depth, establishing baselines and diagnostic tools to advance genuine world simulators.

翻译：尽管思维链提示使大语言模型能够进行复杂的符号推理，但其仍局限于离散的文本领域，无法模拟现实世界中受物理规律支配的连续动态过程。近期视频生成模型通过帧序列推理展现出作为世界模拟器的潜力——将思维具象化为逐帧生成的视觉序列，其中每一帧代表一个基于物理规律的推理步骤。尽管已有令人瞩目的演示案例，但一个根本性挑战依然存在：现有基准主要关注生成保真度或对齐度，未能评估帧序列推理能力，因而无法衡量模型在多步规划、算法逻辑或抽象模式外推等核心认知能力上的表现。这一评估空白阻碍了对模型能力的系统性理解，也缺乏指导模型改进的原则性依据。我们提出Gen-ViRe——一个植根于认知科学与现实世界人工智能应用的生成式视觉推理基准框架，将帧序列推理解构为从感知逻辑到抽象规划的六个认知维度及24项子任务。通过多源数据策展、最小化提示协议，以及结合详细评估标准的混合视觉语言模型辅助评估体系，Gen-ViRe首次实现了对视频模型作为推理器的量化评估。我们在前沿系统上的实验揭示了惊人视觉质量与实际推理深度之间的显著差距，通过建立基线指标与诊断工具，为发展真正的世界模拟器提供了前进路径。