Recent Transformer-based 3D object detectors learn point cloud features either from point- or voxel-based representations. However, the former requires time-consuming sampling while the latter introduces quantization errors. In this paper, we present a novel Point-Voxel Transformer for single-stage 3D detection (PVT-SSD) that takes advantage of these two representations. Specifically, we first use voxel-based sparse convolutions for efficient feature encoding. Then, we propose a Point-Voxel Transformer (PVT) module that obtains long-range contexts in a cheap manner from voxels while attaining accurate positions from points. The key to associating the two different representations is our introduced input-dependent Query Initialization module, which could efficiently generate reference points and content queries. Then, PVT adaptively fuses long-range contextual and local geometric information around reference points into content queries. Further, to quickly find the neighboring points of reference points, we design the Virtual Range Image module, which generalizes the native range image to multi-sensor and multi-frame. The experiments on several autonomous driving benchmarks verify the effectiveness and efficiency of the proposed method. Code will be available at https://github.com/Nightmare-n/PVT-SSD.
翻译:近期基于Transformer的3D目标检测器通过点云或体素表示学习点云特征。然而,前者需要耗时的采样操作,后者则引入量化误差。本文提出一种新颖的点-体素Transformer用于单阶段3D检测(PVT-SSD),该方法融合了两种表示的优势。具体而言,我们首先利用基于体素的稀疏卷积进行高效特征编码,随后提出点-体素Transformer(PVT)模块,该模块能以低成本从体素中获取长程上下文信息,同时从点云中获取精确位置。关联两种不同表示的关键在于我们引入的输入相关查询初始化模块,该模块可高效生成参考点与内容查询。继而PVT自适应地将参考点周围的长程上下文信息与局部几何信息融合至内容查询中。此外,为快速查找参考点的邻域点,我们设计了虚拟距离图像模块,该模块将原生距离图像推广至多传感器与多帧场景。在多个自动驾驶基准数据集上的实验验证了所提方法的有效性与高效性。代码将开源于https://github.com/Nightmare-n/PVT-SSD。