Recent vision-only perception models for autonomous driving achieved promising results by encoding multi-view image features into Bird's-Eye-View (BEV) space. A critical step and the main bottleneck of these methods is transforming image features into the BEV coordinate frame. This paper focuses on leveraging geometry information, such as depth, to model such feature transformation. Existing works rely on non-parametric depth distribution modeling leading to significant memory consumption, or ignore the geometry information to address this problem. In contrast, we propose to use parametric depth distribution modeling for feature transformation. We first lift the 2D image features to the 3D space defined for the ego vehicle via a predicted parametric depth distribution for each pixel in each view. Then, we aggregate the 3D feature volume based on the 3D space occupancy derived from depth to the BEV frame. Finally, we use the transformed features for downstream tasks such as object detection and semantic segmentation. Existing semantic segmentation methods do also suffer from an hallucination problem as they do not take visibility information into account. This hallucination can be particularly problematic for subsequent modules such as control and planning. To mitigate the issue, our method provides depth uncertainty and reliable visibility-aware estimations. We further leverage our parametric depth modeling to present a novel visibility-aware evaluation metric that, when taken into account, can mitigate the hallucination problem. Extensive experiments on object detection and semantic segmentation on the nuScenes datasets demonstrate that our method outperforms existing methods on both tasks.
翻译:近期,面向自动驾驶的纯视觉感知模型通过将多视角图像特征编码至鸟瞰视角(BEV)空间取得了令人瞩目的成果。这些方法的关键步骤及主要瓶颈在于将图像特征转换至BEV坐标系。本文聚焦于利用几何信息(如深度)来建模此类特征变换。现有工作或采用非参数化深度分布建模导致显著的内存消耗,或忽略几何信息以规避该问题。相比之下,我们提出使用参数化深度分布建模进行特征变换。首先,通过为每个视图中的每个像素预测参数化深度分布,将二维图像特征提升至自车定义的3D空间。随后,基于深度导出的3D空间占据关系,将3D特征体聚合至BEV坐标系。最后,将变换后的特征用于目标检测与语义分割等下游任务。现有语义分割方法因未考虑可见性信息而存在幻觉问题,该问题可能对控制与规划等后续模块造成严重干扰。为缓解该问题,我们的方法提供了深度不确定性与可靠的可见性感知估计。此外,我们进一步利用参数化深度建模提出一种新型可见性感知评估指标,该指标在应用时可有效缓解幻觉问题。在nuScenes数据集上的目标检测与语义分割实验表明,我们的方法在两项任务上均优于现有方法。