Although multiview fusion has demonstrated potential in LiDAR segmentation, its dependence on computationally intensive point-based interactions, arising from the lack of fixed correspondences between views such as range view and Bird's-Eye View (BEV), hinders its practical deployment. This paper challenges the prevailing notion that multiview fusion is essential for achieving high performance. We demonstrate that significant gains can be realized by directly fusing Polar and Cartesian partitioning strategies within the BEV space. Our proposed BEV-only segmentation model leverages the inherent fixed grid correspondences between these partitioning schemes, enabling a fusion process that is orders of magnitude faster (170$\times$ speedup) than conventional point-based methods. Furthermore, our approach facilitates dense feature fusion, preserving richer contextual information compared to sparse point-based alternatives. To enhance scene understanding while maintaining inference efficiency, we also introduce a hybrid Transformer-CNN architecture. Extensive evaluation on the SemanticKITTI and nuScenes datasets provides compelling evidence that our method outperforms previous multiview fusion approaches in terms of both performance and inference speed, highlighting the potential of BEV-based fusion for LiDAR segmentation. Code is available at \url{https://github.com/skyshoumeng/PC-BEV.}
翻译:尽管多视图融合在激光雷达分割中已展现出潜力,但其依赖于计算密集型的基于点的交互(这是由于诸如距离视图和鸟瞰图等视图之间缺乏固定对应关系所致),这阻碍了其实际部署。本文挑战了多视图融合对于实现高性能至关重要的普遍观点。我们证明,通过在鸟瞰图空间内直接融合极坐标与笛卡尔坐标划分策略,即可实现显著的性能提升。我们提出的纯BEV分割模型利用了这些划分方案之间固有的固定网格对应关系,从而实现了一个比传统基于点的方法快数个数量级(170倍加速)的融合过程。此外,与稀疏的基于点的方法相比,我们的方法促进了密集特征融合,保留了更丰富的上下文信息。为了在保持推理效率的同时增强场景理解,我们还引入了一种混合Transformer-CNN架构。在SemanticKITTI和nuScenes数据集上的广泛评估提供了有力证据,表明我们的方法在性能和推理速度方面均优于先前的多视图融合方法,凸显了基于BEV的融合在激光雷达分割中的潜力。代码可在 \url{https://github.com/skyshoumeng/PC-BEV} 获取。