Learning unsupervised world models for autonomous driving has the potential to improve the reasoning capabilities of today's systems dramatically. However, most work neglects the physical attributes of the world and focuses on sensor data alone. We propose MUVO, a MUltimodal World Model with Geometric VOxel Representations to address this challenge. We utilize raw camera and lidar data to learn a sensor-agnostic geometric representation of the world, which can directly be used by downstream tasks, such as planning. We demonstrate multimodal future predictions and show that our geometric representation improves the prediction quality of both camera images and lidar point clouds.
翻译:学习无监督世界模型有望极大提升当前自动驾驶系统的推理能力,但多数研究忽视了世界的物理属性,仅关注传感器数据。为应对这一挑战,我们提出MUVO——一种基于几何体素表征的多模态世界模型。该方法利用原始摄像头与激光雷达数据,学习与传感器无关的几何世界表征,可直接服务于规划等下游任务。我们展示了多模态未来预测效果,并证明几何表征能同时提升摄像头图像与激光雷达点云的预测质量。