We introduce Latent-WAM, an efficient end-to-end autonomous driving framework that achieves strong trajectory planning through spatially-aware and dynamics-informed latent world representations. Existing world-model-based planners suffer from inadequately compressed representations, limited spatial understanding, and underutilized temporal dynamics, resulting in sub-optimal planning under constrained data and compute budgets. Latent-WAM addresses these limitations with two core modules: a Spatial-Aware Compressive World Encoder (SCWE) that distills geometric knowledge from a foundation model and compresses multi-view images into compact scene tokens via learnable queries, and a Dynamic Latent World Model (DLWM) that employs a causal Transformer to autoregressively predict future world status conditioned on historical visual and motion representations. Extensive experiments on NAVSIM v2 and HUGSIM demonstrate new state-of-the-art results: 89.3 EPDMS on NAVSIM v2 and 28.9 HD-Score on HUGSIM, surpassing the best prior perception-free method by 3.2 EPDMS with significantly less training data and a compact 104M-parameter model.
翻译:我们提出Latent-WAM,一种高效的端到端自动驾驶框架,通过空间感知与动力学信息增强的潜在世界表征实现强大的轨迹规划。现有基于世界模型的规划器存在表征压缩不充分、空间理解受限以及时间动态利用不足的问题,导致在有限的数据与计算预算下规划性能欠优。Latent-WAM通过两个核心模块解决上述局限:空间感知压缩世界编码器(SCWE)从基础模型中提取几何知识,并借助可学习查询将多视角图像压缩为紧凑的场景标记;动态潜在世界模型(DLWM)采用因果Transformer,基于历史视觉与运动表征自回归预测未来世界状态。在NAVSIM v2与HUGSIM上的大量实验表明,该方法取得了新的最佳结果:NAVSIM v2上EPDMS达89.3,HUGSIM上HD-Score达28.9,以显著更少的训练数据和仅104M参数的紧凑模型,超越此前最优的无感知方法3.2 EPDMS。