World models have emerged as promising neural simulators for autonomous driving, with the potential to supplement scarce real-world data and enable closed-loop evaluations. However, current research primarily evaluates these models based on visual realism or downstream task performance, with limited focus on fidelity to specific action instructions - a crucial property for generating targeted simulation scenes. Although some studies address action fidelity, their evaluations rely on closed-source mechanisms, limiting reproducibility. To address this gap, we develop an open-access evaluation framework, ACT-Bench, for quantifying action fidelity, along with a baseline world model, Terra. Our benchmarking framework includes a large-scale dataset pairing short context videos from nuScenes with corresponding future trajectory data, which provides conditional input for generating future video frames and enables evaluation of action fidelity for executed motions. Furthermore, Terra is trained on multiple large-scale trajectory-annotated datasets to enhance action fidelity. Leveraging this framework, we demonstrate that the state-of-the-art model does not fully adhere to given instructions, while Terra achieves improved action fidelity. All components of our benchmark framework will be made publicly available to support future research.
翻译:世界模型已成为自动驾驶领域前景广阔的神经模拟器,其潜力在于能够补充稀缺的真实世界数据并支持闭环评估。然而,当前研究主要基于视觉真实性或下游任务性能来评估这些模型,对特定动作指令的保真度——这一生成目标仿真场景的关键属性——关注有限。尽管部分研究涉及动作保真度,但其评估依赖于闭源机制,限制了可复现性。为填补这一空白,我们开发了一个用于量化动作保真度的开放评估框架ACT-Bench,并构建了基线世界模型Terra。我们的基准框架包含一个大规模数据集,该数据集将nuScenes中的短上下文视频与对应的未来轨迹数据配对,为生成未来视频帧提供条件输入,并支持对执行动作的保真度进行评估。此外,Terra在多个带轨迹标注的大规模数据集上进行训练,以提升动作保真度。借助该框架,我们证明当前最先进的模型未能完全遵循给定指令,而Terra实现了更优的动作保真度。我们基准框架的所有组件都将公开提供,以支持未来研究。