Perceiving the world and forecasting its future state is a critical task for self-driving. Supervised approaches leverage annotated object labels to learn a model of the world -- traditionally with object detections and trajectory predictions, or temporal bird's-eye-view (BEV) occupancy fields. However, these annotations are expensive and typically limited to a set of predefined categories that do not cover everything we might encounter on the road. Instead, we learn to perceive and forecast a continuous 4D (spatio-temporal) occupancy field with self-supervision from LiDAR data. This unsupervised world model can be easily and effectively transferred to downstream tasks. We tackle point cloud forecasting by adding a lightweight learned renderer and achieve state-of-the-art performance in Argoverse 2, nuScenes, and KITTI. To further showcase its transferability, we fine-tune our model for BEV semantic occupancy forecasting and show that it outperforms the fully supervised state-of-the-art, especially when labeled data is scarce. Finally, when compared to prior state-of-the-art on spatio-temporal geometric occupancy prediction, our 4D world model achieves a much higher recall of objects from classes relevant to self-driving.
翻译:感知世界并预测其未来状态是自动驾驶的关键任务。监督方法利用标注的对象标签来学习世界模型——传统上通过对象检测与轨迹预测,或时序鸟瞰图(BEV)占据场实现。然而,这些标注成本高昂,且通常局限于预定义类别集合,无法涵盖道路上可能遇到的所有事物。为此,我们通过激光雷达数据的自监督学习,实现了对连续四维(时空)占据场的感知与预测。这种无监督世界模型能够轻松高效地迁移至下游任务。我们通过添加轻量级可学习渲染器解决点云预测问题,并在Argoverse 2、nuScenes和KITTI数据集上达到最先进性能。为进一步展示其可迁移性,我们对模型进行微调以实现BEV语义占据预测,结果表明其性能超越全监督的最先进方法,在标注数据稀缺时优势尤为显著。最后,与先前时空几何占据预测的最先进方法相比,我们的四维世界模型对自动驾驶相关类别物体实现了更高的召回率。