The comprehensive understanding capabilities of world models for driving scenarios have significantly improved the planning accuracy of end-to-end autonomous driving frameworks. However, the redundant modeling of static regions and the lack of deep interaction with trajectories hinder world models from exerting their full effectiveness. In this paper, we propose Temporal Residual World Model (TR-World), which focuses on dynamic object modeling. By calculating the temporal residuals of scene representations, the information of dynamic objects can be extracted without relying on detection and tracking. TR-World takes only temporal residuals as input, thus predicting the future spatial distribution of dynamic objects more precisely. By combining the prediction with the static object information contained in the current BEV features, accurate future BEV features can be obtained. Furthermore, we propose Future-Guided Trajectory Refinement (FGTR) module, which conducts interaction between prior trajectories (predicted from the current scene representation) and the future BEV features. This module can not only utilize future road conditions to refine trajectories, but also provides sparse spatial-temporal supervision on future BEV features to prevent world model collapse. Comprehensive experiments conducted on the nuScenes and NAVSIM datasets demonstrate that our method, namely ResWorld, achieves state-of-the-art planning performance. The code is available at https://github.com/mengtan00/ResWorld.git.
翻译:驾驶场景世界模型的综合理解能力显著提升了端到端自动驾驶框架的规划精度。然而,静态区域的冗余建模以及与轨迹的深度交互缺失,阻碍了世界模型充分发挥其效能。本文提出时序残差世界模型(TR-World),其专注于动态物体建模。通过计算场景表征的时序残差,可在不依赖检测与跟踪的情况下提取动态物体信息。TR-World仅以时序残差作为输入,从而能更精确地预测动态物体的未来空间分布。通过将该预测与当前BEV特征中包含的静态物体信息相结合,可获得准确的未来BEV特征。此外,我们提出未来引导轨迹优化(FGTR)模块,该模块在先验轨迹(由当前场景表征预测得出)与未来BEV特征之间进行交互。此模块不仅能利用未来路况优化轨迹,还能为未来BEV特征提供稀疏时空监督,以防止世界模型崩溃。在nuScenes和NAVSIM数据集上进行的综合实验表明,我们的方法(即ResWorld)实现了最先进的规划性能。代码发布于https://github.com/mengtan00/ResWorld.git。