World models have become central to autonomous driving, where accurate scene understanding and future prediction are crucial for safe control. Recent work has explored using vision-language models (VLMs) for planning, yet existing approaches typically treat perception, prediction, and planning as separate modules. We propose UniDrive-WM, a unified VLM-based world model that jointly performs driving-scene understanding, trajectory planning, and trajectory-conditioned future image generation within a single architecture. UniDrive-WM's trajectory planner predicts a future trajectory, which conditions a VLM-based image generator to produce plausible future frames. These predictions provide additional supervisory signals that enhance scene understanding and iteratively refine trajectory generation. We further compare discrete and continuous output representations for future image prediction, analyzing their influence on downstream driving performance. Experiments on the challenging Bench2Drive benchmark show that UniDrive-WM produces high-fidelity future images and improves planning performance by 5.9% in L2 trajectory error and 9.2% in collision rate over the previous best method. These results demonstrate the advantages of tightly integrating VLM-driven reasoning, planning, and generative world modeling for autonomous driving. The project page is available at https://unidrive-wm.github.io/UniDrive-WM .
翻译:世界模型已成为自动驾驶的核心,其中精确的场景理解和未来预测对于安全控制至关重要。近期研究探索了使用视觉语言模型进行规划,但现有方法通常将感知、预测和规划视为独立模块。我们提出了UniDrive-WM,一个基于VLM的统一世界模型,可在单一架构内联合执行驾驶场景理解、轨迹规划和轨迹条件化的未来图像生成。UniDrive-WM的轨迹规划器预测未来轨迹,该轨迹作为条件输入基于VLM的图像生成器,以生成合理的未来帧。这些预测提供了额外的监督信号,增强了场景理解并迭代优化轨迹生成。我们进一步比较了用于未来图像预测的离散和连续输出表示,分析了它们对下游驾驶性能的影响。在具有挑战性的Bench2Drive基准测试上的实验表明,UniDrive-WM能够生成高保真度的未来图像,并将规划性能相较于先前最佳方法提升了5.9%的L2轨迹误差和9.2%的碰撞率。这些结果证明了将VLM驱动的推理、规划和生成式世界建模紧密集成用于自动驾驶的优势。项目页面位于 https://unidrive-wm.github.io/UniDrive-WM 。