Evolutions in the world, such as water pouring or ice melting, happen regardless of being observed. Video world models generate "worlds" via 2D frame observations. Can these generated "worlds" evolve regardless of observation? To probe this question, we design a benchmark to evaluate whether video world models can decouple state evolution from observation. Our benchmark, STEVO-Bench, applies observation control to evolving processes via instructions of occluder insertion, turning off the light, or specifying camera "lookaway" trajectories. By evaluating video models with and without camera control for a diverse set of naturally-occurring evolutions, we expose their limitations in decoupling state evolution from observation. STEVO-Bench proposes an evaluation protocol to automatically detect and disentangle failure modes of video world models across key aspects of natural state evolution. Analysis of STEVO-Bench results provide new insight into potential data and architecture bias of present-day video world models. Project website: https://glab-caltech.github.io/STEVOBench/. Blog: https://ziqi-ma.github.io/blog/2026/outofsight/
翻译:世界中的演化过程,例如倒水或冰融化,无论是否被观测都会发生。视频世界模型通过二维帧观测生成"世界"。这些生成的"世界"能否在无观测条件下自主演化?为探究此问题,我们设计了一个基准测试来评估视频世界模型能否将状态演化与观测解耦。我们的基准测试STEVO-Bench通过遮挡物插入指令、关闭灯光或指定相机"移开视线"轨迹等方式,对演化过程实施观测控制。通过对具备和不具备相机控制的视频模型在多种自然演化场景中进行评估,我们揭示了它们在解耦状态演化与观测方面的局限性。STEVO-Bench提出了一种自动检测和分离视频世界模型在自然状态演化关键维度上失效模式的评估方案。对STEVO-Bench结果的分析为当前视频世界模型潜在的数据和架构偏差提供了新的见解。项目网站:https://glab-caltech.github.io/STEVOBench/。博客:https://ziqi-ma.github.io/blog/2026/outofsight/