Cam4DOcc: Benchmark for Camera-Only 4D Occupancy Forecasting in Autonomous Driving Applications

Understanding how the surrounding environment changes is crucial for performing downstream tasks safely and reliably in autonomous driving applications. Recent occupancy estimation techniques using only camera images as input can provide dense occupancy representations of large-scale scenes based on the current observation. However, they are mostly limited to representing the current 3D space and do not consider the future state of surrounding objects along the time axis. To extend camera-only occupancy estimation into spatiotemporal prediction, we propose Cam4DOcc, a new benchmark for camera-only 4D occupancy forecasting, evaluating the surrounding scene changes in a near future. We build our benchmark based on multiple publicly available datasets, including nuScenes, nuScenes-Occupancy, and Lyft-Level5, which provides sequential occupancy states of general movable and static objects, as well as their 3D backward centripetal flow. To establish this benchmark for future research with comprehensive comparisons, we introduce four baseline types from diverse camera-based perception and prediction implementations, including a static-world occupancy model, voxelization of point cloud prediction, 2D-3D instance-based prediction, and our proposed novel end-to-end 4D occupancy forecasting network. Furthermore, the standardized evaluation protocol for preset multiple tasks is also provided to compare the performance of all the proposed baselines on present and future occupancy estimation with respect to objects of interest in autonomous driving scenarios. The dataset and our implementation of all four baselines in the proposed Cam4DOcc benchmark will be released here: https://github.com/haomo-ai/Cam4DOcc.

翻译：理解周围环境如何变化对于在自动驾驶应用中安全可靠地执行下游任务至关重要。近期仅以摄像图像为输入的占据估计技术，可根据当前观测提供大规模场景的密集占据表示。然而，这些技术大多局限于表示当前三维空间，未沿时间轴考虑周围物体的未来状态。为将纯摄像占据估计拓展至时空预测，我们提出了Cam4DOcc——一个新的纯摄像4D占据预测基准，用于评估近期内周围场景的变化。该基准基于多个公开数据集构建，包括nuScenes、nuScenes-Occupancy和Lyft-Level5，提供通用可移动与静态物体的时序占据状态及其三维后向向心流。为建立该基准以支持未来研究的全面比较，我们引入了四种源自不同摄像感知与预测实现的基线类型：静态世界占据模型、点云预测体素化、基于2D-3D实例的预测，以及我们提出的新型端到端4D占据预测网络。此外，我们还提供了针对预设多任务的标准化评估协议，用于在自动驾驶场景中比较所有基线在关注物体当前与未来占据估计上的性能。数据集及Cam4DOcc基准中所有四种基线的实现将发布于此：https://github.com/haomo-ai/Cam4DOcc。