World models that forecast environmental changes from actions are vital for autonomous driving models with strong generalization. The prevailing driving world model mainly build on video prediction model. Although these models can produce high-fidelity video sequences with advanced diffusion-based generator, they are constrained by their predictive duration and overall generalization capabilities. In this paper, we explore to solve this problem by combining generation loss with MAE-style feature-level context learning. In particular, we instantiate this target with three key design: (1) A more scalable Diffusion Transformer (DiT) structure trained with extra mask construction task. (2) we devise diffusion-related mask tokens to deal with the fuzzy relations between mask reconstruction and generative diffusion process. (3) we extend mask construction task to spatial-temporal domain by utilizing row-wise mask for shifted self-attention rather than masked self-attention in MAE. Then, we adopt a row-wise cross-view module to align with this mask design. Based on above improvement, we propose MaskGWM: a Generalizable driving World Model embodied with Video Mask reconstruction. Our model contains two variants: MaskGWM-long, focusing on long-horizon prediction, and MaskGWM-mview, dedicated to multi-view generation. Comprehensive experiments on standard benchmarks validate the effectiveness of the proposed method, which contain normal validation of Nuscene dataset, long-horizon rollout of OpenDV-2K dataset and zero-shot validation of Waymo dataset. Quantitative metrics on these datasets show our method notably improving state-of-the-art driving world model.
翻译:能够从动作预测环境变化的世界模型对于具有强泛化能力的自动驾驶模型至关重要。主流的驾驶世界模型主要基于视频预测模型构建。尽管这些模型能够通过先进的基于扩散的生成器产生高保真度的视频序列,但其预测时长和整体泛化能力仍受到限制。本文通过将生成损失与MAE风格的特征级上下文学习相结合来探索解决这一问题。具体而言,我们通过三个关键设计实现这一目标:(1)采用更具可扩展性的扩散Transformer(DiT)结构,并辅以额外的掩码重建任务进行训练。(2)设计扩散相关的掩码标记以处理掩码重建与生成扩散过程之间的模糊关系。(3)通过利用行级掩码进行移位自注意力(而非MAE中的掩码自注意力),将掩码重建任务扩展到时空域。随后,我们采用行级跨视图模块以适配此掩码设计。基于上述改进,我们提出了MaskGWM:一种融合视频掩码重建的可泛化驾驶世界模型。我们的模型包含两个变体:专注于长时程预测的MaskGWM-long,以及致力于多视图生成的MaskGWM-mview。在标准基准测试上的综合实验验证了所提方法的有效性,包括NuScenes数据集的常规验证、OpenDV-2K数据集的长时程推演以及Waymo数据集的零样本验证。这些数据集上的定量指标表明,我们的方法显著提升了当前最先进驾驶世界模型的性能。