In autonomous vehicles, understanding the surrounding 3D environment of the ego vehicle in real-time is essential. A compact way to represent scenes while encoding geometric distances and semantic object information is via 3D semantic occupancy maps. State of the art 3D mapping methods leverage transformers with cross-attention mechanisms to elevate 2D vision-centric camera features into the 3D domain. However, these methods encounter significant challenges in real-time applications due to their high computational demands during inference. This limitation is particularly problematic in autonomous vehicles, where GPU resources must be shared with other tasks such as localization and planning. In this paper, we introduce an approach that extracts features from front-view 2D camera images and LiDAR scans, then employs a sparse convolution network (Minkowski Engine), for 3D semantic occupancy prediction. Given that outdoor scenes in autonomous driving scenarios are inherently sparse, the utilization of sparse convolution is particularly apt. By jointly solving the problems of 3D scene completion of sparse scenes and 3D semantic segmentation, we provide a more efficient learning framework suitable for real-time applications in autonomous vehicles. We also demonstrate competitive accuracy on the nuScenes dataset.
翻译:在自动驾驶车辆中,实时理解自车周围的三维环境至关重要。一种在编码几何距离与语义物体信息的同时紧凑表示场景的方式是采用三维语义占据地图。当前最先进的三维地图构建方法利用带交叉注意力机制的Transformer将二维视觉为中心的相机特征提升至三维域。然而,这些方法在推理过程中因计算需求极高而难以应用于实时场景,这一局限性在自动驾驶车辆中尤为突出——其GPU资源需与定位、规划等其他任务共享。本文提出一种方法:从前视二维相机图像与激光雷达扫描中提取特征,进而采用稀疏卷积网络(Minkowski Engine)进行三维语义占据预测。鉴于自动驾驶户外场景天然具有稀疏性,稀疏卷积的运用尤为贴切。通过联合求解稀疏场景的三维场景补全与三维语义分割问题,我们构建了一个适用于自动驾驶车辆实时应用的高效学习框架。在nuScenes数据集上的实验也验证了该方法具有竞争力的精度。