Developing high-performance, real-time architectures for LiDAR-based 3D object detectors is essential for the successful commercialization of autonomous vehicles. Pillar-based methods stand out as a practical choice for onboard deployment due to their computational efficiency. However, despite their efficiency, these methods can sometimes underperform compared to alternative point encoding techniques such as Voxel-encoding or PointNet++. We argue that current pillar-based methods have not sufficiently captured the fine-grained distributions of LiDAR points within each pillar structure. Consequently, there exists considerable room for improvement in pillar feature encoding. In this paper, we introduce a novel pillar encoding architecture referred to as Fine-Grained Pillar Feature Encoding (FG-PFE). FG-PFE utilizes Spatio-Temporal Virtual (STV) grids to capture the distribution of point clouds within each pillar across vertical, temporal, and horizontal dimensions. Through STV grids, points within each pillar are individually encoded using Vertical PFE (V-PFE), Temporal PFE (T-PFE), and Horizontal PFE (H-PFE). These encoded features are then aggregated through an Attentive Pillar Aggregation method. Our experiments conducted on the nuScenes dataset demonstrate that FG-PFE achieves significant performance improvements over baseline models such as PointPillar, CenterPoint-Pillar, and PillarNet, with only a minor increase in computational overhead.
翻译:开发用于激光雷达三维目标检测器的高性能、实时架构对于自动驾驶汽车的成功商业化至关重要。基于柱体(Pillar)的方法因其计算效率而成为车载部署的实用选择。然而,尽管这些方法效率很高,但与体素编码(Voxel-encoding)或PointNet++等替代点编码技术相比,有时性能可能较低。我们认为,当前基于柱体的方法未能充分捕捉每个柱体结构内激光雷达点的细粒度分布。因此,柱体特征编码仍有相当大的改进空间。本文提出一种新型的柱体编码架构,称为细粒度柱体特征编码(FG-PFE)。FG-PFE利用时空虚拟(STV)网格从垂直、时间和水平维度捕捉每个柱体内点云的分布。通过STV网格,每个柱体内的点分别使用垂直PFE(V-PFE)、时间PFE(T-PFE)和水平PFE(H-PFE)进行编码。这些编码特征随后通过注意力柱体聚合方法进行聚合。我们在nuScenes数据集上进行的实验表明,FG-PFE在仅增加少量计算开销的情况下,相较于PointPillar、CenterPoint-Pillar和PillarNet等基线模型取得了显著的性能提升。