VPFusion: Towards Robust Vertical Representation Learning for 3D Object Detection

Efficient point cloud representation is a fundamental element of Lidar-based 3D object detection. Recent grid-based detectors usually divide point clouds into voxels or pillars and construct single-stream networks in Bird's Eye View. However, these point cloud encoding paradigms underestimate the point representation in the vertical direction, which cause the loss of semantic or fine-grained information, especially for vertical sensitive objects like pedestrian and cyclists. In this paper, we propose an explicit vertical multi-scale representation learning framework, VPFusion, to combine the complementary information from both voxel and pillar streams. Specifically, VPFusion first builds upon a sparse voxel-pillar-based backbone. The backbone divides point clouds into voxels and pillars, then encodes features with 3D and 2D sparse convolution simultaneously. Next, we introduce the Sparse Fusion Layer (SFL), which establishes a bidirectional pathway for sparse voxel and pillar features to enable the interaction between them. Additionally, we present the Dense Fusion Neck (DFN) to effectively combine the dense feature maps from voxel and pillar branches with multi-scale. Extensive experiments on the large-scale Waymo Open Dataset and nuScenes Dataset demonstrate that VPFusion surpasses the single-stream baselines by a large margin and achieves state-of-the-art performance with real-time inference speed.

翻译：高效的点云表示是激光雷达3D物体检测的基础要素。当前基于网格的检测器通常将点云划分为体素或柱体，并在鸟瞰视角下构建单流网络。然而，这些点云编码范式低估了垂直方向上的点表示能力，导致语义或细粒度信息丢失，尤其对行人、骑行者等对垂直方向敏感的物体影响显著。本文提出一种显式垂直多尺度表示学习框架VPFusion，用于融合体素流与柱体流的互补信息。具体而言，VPFusion首先基于稀疏体素-柱体主干网络，该网络将点云划分为体素和柱体，并同步采用3D和2D稀疏卷积进行特征编码。其次，我们引入稀疏融合层（SFL），通过建立稀疏体素与柱体特征的双向通路实现二者交互。此外，提出密集融合颈部（DFN）以多尺度方式有效融合体素分支与柱体分支的密集特征图。在Waymo Open Dataset和nuScenes Dataset大规模数据集上的实验表明，VPFusion以显著优势超越单流基线方法，并在实时推理速度下达到最优性能。

相关内容

点云

关注 50

根据激光测量原理得到的点云，包括三维坐标（XYZ）和激光反射强度（Intensity）。根据摄影测量原理得到的点云，包括三维坐标（XYZ）和颜色信息（RGB）。结合激光测量和摄影测量原理得到点云，包括三维坐标（XYZ）、激光反射强度（Intensity）和颜色信息（RGB）。在获取物体表面每个采样点的空间坐标后，得到的是一个点的集合，称之为“点云”(Point Cloud)

【CVPR2022】自动驾驶中的伪双目三维目标检测，Pseudo-Stereo for Monocular 3D Object Detection in Autonomous Driving

专知会员服务

18+阅读 · 2022年3月19日

【CVPR 2022】单目3D语义场景完成框架，MonoScene: Monocular 3D Semantic Scene Completion

专知会员服务

16+阅读 · 2022年3月3日

近期必读的六篇 ECCV 2020【行人重识别（ReID）】相关论文

专知会员服务

36+阅读 · 2020年8月4日

近期必读的六篇计算机视觉顶会ECCV 2020【目标检测】相关论文

专知会员服务

59+阅读 · 2020年7月7日