3D occupancy prediction (Occ) is a rapidly rising challenging perception task in the field of autonomous driving which represents the driving scene as uniformly partitioned 3D voxel grids with semantics. Compared to 3D object detection, grid perception has great advantage of better recognizing irregularly shaped, unknown category, or partially occluded general objects. However, existing 3D occupancy networks (occnets) are both computationally heavy and label-hungry. In terms of model complexity, occnets are commonly composed of heavy Conv3D modules or transformers on the voxel level. In terms of label annotations requirements, occnets are supervised with large-scale expensive dense voxel labels. Model and data inefficiency, caused by excessive network parameters and label annotations requirement, severely hinder the onboard deployment of occnets. This paper proposes an efficient 3d occupancy network (EFFOcc), that targets the minimal network complexity and label requirement while achieving state-of-the-art accuracy. EFFOcc only uses simple 2D operators, and improves Occ accuracy to the state-of-the-art on multiple large-scale benchmarks: Occ3D-nuScenes, Occ3D-Waymo, and OpenOccupancy-nuScenes. On Occ3D-nuScenes benchmark, EFFOcc has only 18.4M parameters, and achieves 50.46 in terms of mean IoU (mIoU), to our knowledge, it is the occnet with minimal parameters compared with related occnets. Moreover, we propose a two-stage active learning strategy to reduce the requirements of labelled data. Active EFFOcc trained with 6\% labelled voxels achieves 47.19 mIoU, which is 95.7% fully supervised performance. The proposed EFFOcc also supports improved vision-only occupancy prediction with the aid of region-decomposed distillation. Code and demo videos will be available at https://github.com/synsin0/EFFOcc.
翻译:三维占据预测(Occ)是自动驾驶领域中快速兴起的具有挑战性的感知任务,其将驾驶场景表示为具有语义信息的均匀划分的三维体素网格。与三维目标检测相比,网格感知在更好地识别不规则形状、未知类别或部分遮挡的通用物体方面具有显著优势。然而,现有的三维占据网络(occnets)通常计算量大且依赖大量标注数据。在模型复杂度方面,occnets通常由体素层级的重型Conv3D模块或Transformer构成。在标注需求方面,occnets需要大规模昂贵的密集体素标签进行监督。由过度的网络参数和标注需求导致的模型与数据低效性,严重阻碍了occnets在车载平台上的部署。本文提出了一种高效的三维占据网络(EFFOcc),旨在以最小的网络复杂度和标注需求实现最先进的精度。EFFOcc仅使用简单的二维算子,并在多个大规模基准测试(Occ3D-nuScenes、Occ3D-Waymo和OpenOccupancy-nuScenes)上将Occ精度提升至最先进水平。在Occ3D-nuScenes基准测试中,EFFOcc仅有1840万参数,并以50.46的平均交并比(mIoU)达到最佳性能;据我们所知,这是与相关occnets相比参数最少的网络。此外,我们提出了一种两阶段主动学习策略以降低对标注数据的需求。使用6%标注体素训练的主动EFFOcc实现了47.19的mIoU,达到了全监督性能的95.7%。所提出的EFFOcc还通过区域分解蒸馏技术支持改进的纯视觉占据预测。代码和演示视频将在https://github.com/synsin0/EFFOcc 发布。