Driving World Models (DWMs) have become essential for autonomous driving by enabling future scene prediction. However, existing DWMs are limited to scene generation and fail to incorporate scene understanding, which involves interpreting and reasoning about the driving environment. In this paper, we present a unified Driving World Model named HERMES. We seamlessly integrate 3D scene understanding and future scene evolution (generation) through a unified framework in driving scenarios. Specifically, HERMES leverages a Bird's-Eye View (BEV) representation to consolidate multi-view spatial information while preserving geometric relationships and interactions. We also introduce world queries, which incorporate world knowledge into BEV features via causal attention in the Large Language Model, enabling contextual enrichment for understanding and generation tasks. We conduct comprehensive studies on nuScenes and OmniDrive-nuScenes datasets to validate the effectiveness of our method. HERMES achieves state-of-the-art performance, reducing generation error by 32.4% and improving understanding metrics such as CIDEr by 8.0%. The model and code will be publicly released at https://github.com/LMD0311/HERMES.
翻译:驾驶世界模型通过实现未来场景预测,已成为自动驾驶领域的关键技术。然而,现有驾驶世界模型仅限于场景生成,未能整合涉及驾驶环境解释与推理的场景理解能力。本文提出了一种名为HERMES的统一驾驶世界模型。我们在驾驶场景中通过统一框架,无缝集成了三维场景理解与未来场景演化(生成)功能。具体而言,HERMES利用鸟瞰图表示整合多视角空间信息,同时保持几何关系与交互特性。我们还引入了世界查询机制,通过大型语言模型中的因果注意力将世界知识融入鸟瞰图特征,从而为理解与生成任务提供上下文增强。我们在nuScenes和OmniDrive-nuScenes数据集上进行了全面实验,验证了方法的有效性。HERMES取得了最先进的性能,生成误差降低32.4%,理解指标(如CIDEr)提升8.0%。模型与代码将在https://github.com/LMD0311/HERMES公开发布。