In autonomous driving, 3D occupancy prediction outputs voxel-wise status and semantic labels for more comprehensive understandings of 3D scenes compared with traditional perception tasks, such as 3D object detection and bird's-eye view (BEV) semantic segmentation. Recent researchers have extensively explored various aspects of this task, including view transformation techniques, ground-truth label generation, and elaborate network design, aiming to achieve superior performance. However, the inference speed, crucial for running on an autonomous vehicle, is neglected. To this end, a new method, dubbed FastOcc, is proposed. By carefully analyzing the network effect and latency from four parts, including the input image resolution, image backbone, view transformation, and occupancy prediction head, it is found that the occupancy prediction head holds considerable potential for accelerating the model while keeping its accuracy. Targeted at improving this component, the time-consuming 3D convolution network is replaced with a novel residual-like architecture, where features are mainly digested by a lightweight 2D BEV convolution network and compensated by integrating the 3D voxel features interpolated from the original image features. Experiments on the Occ3D-nuScenes benchmark demonstrate that our FastOcc achieves state-of-the-art results with a fast inference speed.
翻译:在自动驾驶领域,三维占据预测相较于传统感知任务(如三维目标检测和鸟瞰图语义分割),能够输出体素级状态和语义标签,从而更全面地理解三维场景。近期研究者已从视角转换技术、真值标签生成和精细网络设计等多个维度对该任务进行了深入探索,旨在获得更优性能。然而,在自动驾驶车辆中实际运行至关重要的推理速度却被忽视了。为此,本文提出一种名为FastOcc的新方法。通过系统分析输入图像分辨率、图像骨干网络、视角转换和占据预测头四个组成部分的网络效果与延迟,发现占据预测头在保持精度的同时具有显著加速潜力。针对这一组件的优化,我们采用新颖的残差式架构替代耗时三维卷积网络,该架构主要通过轻量级二维BEV卷积网络处理特征,并通过融合从原始图像特征插值得到的三维体素特征进行补偿。在Occ3D-nuScenes基准上的实验表明,我们的FastOcc在实现快速推理速度的同时,达到了业界领先水平。