End-to-end autonomous driving increasingly leverages self-supervised video pretraining to learn transferable planning representations. However, pretraining video world models for scene understanding has so far brought only limited improvements. This limitation is compounded by the inherent ambiguity of driving: each scene typically provides only a single human trajectory, making it difficult to learn multimodal behaviors. In this work, we propose Drive-JEPA, a framework that integrates Video Joint-Embedding Predictive Architecture (V-JEPA) with multimodal trajectory distillation for end-to-end driving. First, we adapt V-JEPA for end-to-end driving, pretraining a ViT encoder on large-scale driving videos to produce predictive representations aligned with trajectory planning. Second, we introduce a proposal-centric planner that distills diverse simulator-generated trajectories alongside human trajectories, with a momentum-aware selection mechanism to promote stable and safe behavior. When evaluated on NAVSIM, the V-JEPA representation combined with a simple transformer-based decoder outperforms prior methods by 3 PDMS in the perception-free setting. The complete Drive-JEPA framework achieves 93.3 PDMS on v1 and 87.8 EPDMS on v2, setting a new state-of-the-art.
翻译:端到端自动驾驶日益利用自监督视频预训练来学习可迁移的规划表示。然而,为场景理解预训练视频世界模型迄今仅带来有限的改进。这一局限性因驾驶固有的模糊性而加剧:每个场景通常仅提供单一人为轨迹,使得学习多模态行为变得困难。在本工作中,我们提出Drive-JEPA,一个将视频联合嵌入预测架构与多模态轨迹蒸馏相结合用于端到端驾驶的框架。首先,我们将V-JEPA适配于端到端驾驶,在大规模驾驶视频上预训练ViT编码器,以产生与轨迹规划对齐的预测性表示。其次,我们引入一个以提案为中心的规划器,该规划器将模拟器生成的多条轨迹与人为轨迹共同蒸馏,并采用动量感知选择机制以促进稳定且安全的行为。在NAVSIM上评估时,V-JEPA表示与一个简单的基于Transformer的解码器结合,在无感知设置下以3 PDMS的优势超越先前方法。完整的Drive-JEPA框架在v1上达到93.3 PDMS,在v2上达到87.8 EPDMS,创造了新的最先进水平。