Autonomous driving, as an agent operating in the physical world, requires the fundamental capability to build \textit{world models} that capture how the environment evolves spatiotemporally in order to support long-term planning. At the same time, scalability demands learning such models in a self-supervised manner; \textit{joint-embedding predictive architecture (JEPA)} enables learning world models via leveraging large volumes of unlabeled data without relying on expensive human annotations. In this paper, we propose \textbf{AD-LiST-JEPA}, a self-supervised world model for autonomous driving that predicts future spatiotemporal evolution from LiDAR data using a JEPA framework. We evaluate the quality of the learned representations through a downstream LiDAR-based occupancy completion and forecasting (OCF) task, which jointly assesses perception and prediction. Proof of concept experiments show better OCF performance with pretrained encoder after JEPA-based world model learning.
翻译:自动驾驶作为在物理世界中运行的智能体,需要具备构建捕获环境时空演化规律的\textit{世界模型}这一基础能力,以支持长期规划。同时,可扩展性要求以自监督方式学习此类模型;\textit{联合嵌入预测架构(JEPA)}能够利用大量未标注数据学习世界模型,无需依赖昂贵的人工标注。本文提出\textbf{AD-LiST-JEPA},一种用于自动驾驶的自监督世界模型,该模型采用JEPA框架从激光雷达数据预测未来时空演化。我们通过下游基于激光雷达的占用补全与预测(OCF)任务评估所学表征的质量,该任务联合评估感知与预测能力。概念验证实验表明,经过基于JEPA的世界模型学习后,预训练编码器在OCF任务上表现出更优性能。