Cam4DOcc: Benchmark for Camera-Only 4D Occupancy Forecasting in Autonomous Driving Applications

Understanding how the surrounding environment changes is crucial for performing downstream tasks safely and reliably in autonomous driving applications. Recent occupancy estimation techniques using only camera images as input can provide dense occupancy representations of large-scale scenes based on the current observation. However, they are mostly limited to representing the current 3D space and do not consider the future state of surrounding objects along the time axis. To extend camera-only occupancy estimation into spatiotemporal prediction, we propose Cam4DOcc, a new benchmark for camera-only 4D occupancy forecasting, evaluating the surrounding scene changes in a near future. We build our benchmark based on multiple publicly available datasets, including nuScenes, nuScenes-Occupancy, and Lyft-Level5, which provides sequential occupancy states of general movable and static objects, as well as their 3D backward centripetal flow. To establish this benchmark for future research with comprehensive comparisons, we introduce four baseline types from diverse camera-based perception and prediction implementations, including a static-world occupancy model, voxelization of point cloud prediction, 2D-3D instance-based prediction, and our proposed novel end-to-end 4D occupancy forecasting network. Furthermore, the standardized evaluation protocol for preset multiple tasks is also provided to compare the performance of all the proposed baselines on present and future occupancy estimation with respect to objects of interest in autonomous driving scenarios. The dataset and our implementation of all four baselines in the proposed Cam4DOcc benchmark will be released here: https://github.com/haomo-ai/Cam4DOcc.

翻译：理解周围环境的变化对于在自动驾驶应用中安全可靠地执行下游任务至关重要。近期仅以相机图像为输入的占用估计技术，能够基于当前观测提供大规模场景的稠密占用表示。然而，这些技术大多局限于表征当前三维空间，并未沿时间轴考虑周围物体的未来状态。为将纯视觉占用估计拓展至时空预测，我们提出了Cam4DOcc——一种面向纯视觉4D占用预测的新基准，用于评估近未来场景变化。本基准基于多个公开数据集构建，包括nuScenes、nuScenes-Occupancy和Lyft-Level5，这些数据集提供了通用可移动物体与静态物体的时序占用状态及其三维后向向心流。为建立未来研究的综合对比基准，我们引入了四种源自不同相机感知与预测实现方案的基线类型：静态世界占用模型、点云预测的体素化方法、基于2D-3D实例的预测方法，以及我们提出的新型端到端4D占用预测网络。此外，我们还提供了预设多任务的标准化评估协议，以比较所有基线在自动驾驶场景中针对感兴趣物体的当前与未来占用估计性能。本数据集及Cam4DOcc基准中四种基线的实现代码将在以下链接发布：https://github.com/haomo-ai/Cam4DOcc。