World models aim to understand, remember, and predict dynamic visual environments, yet a unified benchmark for evaluating their fundamental abilities remains lacking. To address this gap, we introduce MIND, the first open-domain closed-loop revisited benchmark for evaluating Memory consIstency and action coNtrol in worlD models. MIND contains 250 high-quality videos at 1080p and 24 FPS, including 100 (first-person) + 100 (third-person) video clips under a shared action space and 25 + 25 clips across varied action spaces covering eight diverse scenes. We design an efficient evaluation framework to measure two core abilities: memory consistency and action control, capturing temporal stability and contextual coherence across viewpoints. Furthermore, we design various action spaces, including different character movement speeds and camera rotation angles, to evaluate the action generalization capability across different action spaces under shared scenes. To facilitate future performance benchmarking on MIND, we introduce MIND-World, a novel interactive Video-to-World baseline. Extensive experiments demonstrate the completeness of MIND and reveal key challenges in current world models, including the difficulty of maintaining long-term memory consistency and generalizing across action spaces. Code: https://github.com/CSU-JPG/MIND.
翻译:世界模型旨在理解、记忆并预测动态视觉环境,然而目前仍缺乏评估其基础能力的统一基准。为填补这一空白,我们提出了MIND,首个用于评估世界模型中记忆一致性与动作控制的开放域闭环重访基准。MIND包含250段1080p分辨率、24帧率的高质量视频,其中包含共享动作空间下的100段(第一人称)+100段(第三人称)视频片段,以及覆盖八个多样场景的25+25段跨不同动作空间的片段。我们设计了一个高效的评估框架,用于衡量两种核心能力:记忆一致性与动作控制,以捕捉跨视角的时间稳定性与上下文连贯性。此外,我们设计了多种动作空间(包括不同的角色移动速度和相机旋转角度),以评估共享场景下跨不同动作空间的动作泛化能力。为便于未来在MIND上进行性能基准测试,我们提出了MIND-World——一种新颖的交互式视频到世界基线方法。大量实验证明了MIND的完备性,并揭示了当前世界模型面临的关键挑战,包括维持长期记忆一致性以及跨动作空间泛化的困难。代码:https://github.com/CSU-JPG/MIND。