While recent video world models can generate highly realistic videos, their ability to perform semantic reasoning and planning remains unclear and unquantified. We introduce Target-Bench, the first benchmark that enables comprehensive evaluation of video world models' semantic reasoning, spatial estimation, and planning capabilities. Target-Bench provides 450 robot-collected scenarios spanning 47 semantic categories, with SLAM-based trajectories serving as motion tendency references. Our benchmark reconstructs motion from generated videos with a metric scale recovery mechanism, enabling the evaluation of planning performance with five complementary metrics that focus on target-approaching capability and directional consistency. Our evaluation result shows that the best off-the-shelf model achieves only a 0.341 overall score, revealing a significant gap between realistic visual generation and semantic reasoning in current video world models. Furthermore, we demonstrate that fine-tuning process on a relatively small real-world robot dataset can significantly improve task-level planning performance.
翻译:尽管近期视频世界模型已能生成高度逼真的视频,但其执行语义推理与规划的能力仍不明确且缺乏量化评估。我们提出首个综合性基准测试框架Target-Bench,用于全面评估视频世界模型在语义推理、空间估计及规划能力等方面的表现。该框架包含450个机器人采集场景,覆盖47个语义类别,并以基于SLAM的轨迹作为运动趋势参考。通过引入公制尺度恢复机制,本框架可从生成视频中重建运动信息,并利用五项互补指标评估规划性能,聚焦目标趋近能力与方向一致性。评估结果显示,当前最佳现成模型仅取得0.341的综合评分,揭示了现有视频世界模型在逼真视觉生成与语义推理之间的显著鸿沟。此外,我们证明对较小规模真实机器人数据集进行微调,可显著提升任务级规划性能。