Vision-based perception for autonomous driving requires an explicit modeling of a 3D space, where 2D latent representations are mapped and subsequent 3D operators are applied. However, operating on dense latent spaces introduces a cubic time and space complexity, which limits scalability in terms of perception range or spatial resolution. Existing approaches compress the dense representation using projections like Bird's Eye View (BEV) or Tri-Perspective View (TPV). Although efficient, these projections result in information loss, especially for tasks like semantic occupancy prediction. To address this, we propose SparseOcc, an efficient occupancy network inspired by sparse point cloud processing. It utilizes a lossless sparse latent representation with three key innovations. Firstly, a 3D sparse diffuser performs latent completion using spatially decomposed 3D sparse convolutional kernels. Secondly, a feature pyramid and sparse interpolation enhance scales with information from others. Finally, the transformer head is redesigned as a sparse variant. SparseOcc achieves a remarkable 74.9% reduction on FLOPs over the dense baseline. Interestingly, it also improves accuracy, from 12.8% to 14.1% mIOU, which in part can be attributed to the sparse representation's ability to avoid hallucinations on empty voxels.
翻译:自动驾驶中基于视觉的感知需要对三维空间进行显式建模,其中二维潜在表示被映射并应用后续的三维算子。然而,对稠密潜在空间进行操作会引入立方时间与空间复杂度,从而限制了感知范围或空间分辨率的可扩展性。现有方法通过鸟瞰图或三视角视图等投影压缩稠密表示。尽管高效,这些投影会导致信息损失,尤其是在语义占据预测等任务中。为此,我们提出SparseOcc——一种受稀疏点云处理启发的高效占据网络。它利用无损稀疏潜在表示,并包含三项关键创新:首先,三维稀疏扩散器通过空间分解的三维稀疏卷积核执行潜在补全;其次,特征金字塔与稀疏插值增强各尺度与其他尺度的信息交互;最后,将Transformer头部重新设计为稀疏变体。SparseOcc在稠密基线基础上实现了74.9%的FLOPs缩减。有趣的是,它还将精度从12.8%提升至14.1% mIOU,这在一定程度上归因于稀疏表示能避免在空体素上产生幻觉。