In autonomous vehicles, understanding the surrounding 3D environment of the ego vehicle in real-time is essential. A compact way to represent scenes while encoding geometric distances and semantic object information is via 3D semantic occupancy maps. State of the art 3D mapping methods leverage transformers with cross-attention mechanisms to elevate 2D vision-centric camera features into the 3D domain. However, these methods encounter significant challenges in real-time applications due to their high computational demands during inference. This limitation is particularly problematic in autonomous vehicles, where GPU resources must be shared with other tasks such as localization and planning. In this paper, we introduce an approach that extracts features from front-view 2D camera images and LiDAR scans, then employs a sparse convolution network (Minkowski Engine), for 3D semantic occupancy prediction. Given that outdoor scenes in autonomous driving scenarios are inherently sparse, the utilization of sparse convolution is particularly apt. By jointly solving the problems of 3D scene completion of sparse scenes and 3D semantic segmentation, we provide a more efficient learning framework suitable for real-time applications in autonomous vehicles. We also demonstrate competitive accuracy on the nuScenes dataset.
翻译:在自动驾驶中,实时理解自车周围的三维环境至关重要。一种兼顾几何距离与语义目标信息、紧凑表示场景的方式是三维语义占用图。现有最先进的三维建图方法利用带有交叉注意力机制的Transformer,将二维视觉中心化相机特征提升至三维域。然而,这些方法在推理过程中计算需求极高,在实时应用中面临显著挑战。这一局限在自动驾驶场景中尤为突出,因为GPU资源需与定位、规划等其他任务共享。本文提出一种方法,从前视二维相机图像与激光雷达扫描中提取特征,进而采用稀疏卷积网络(Minkowski Engine)实现三维语义占用预测。鉴于自动驾驶室外场景天然具有稀疏性,利用稀疏卷积尤为适宜。通过联合求解稀疏场景的三维补全与三维语义分割问题,我们构建了适用于自动驾驶实时应用的更高效学习框架。在nuScenes数据集上的实验也验证了该方法具有竞争力的精度。