We propose a late-to-early recurrent feature fusion scheme for 3D object detection using temporal LiDAR point clouds. Our main motivation is fusing object-aware latent embeddings into the early stages of a 3D object detector. This feature fusion strategy enables the model to better capture the shapes and poses for challenging objects, compared with learning from raw points directly. Our method conducts late-to-early feature fusion in a recurrent manner. This is achieved by enforcing window-based attention blocks upon temporally calibrated and aligned sparse pillar tokens. Leveraging bird's eye view foreground pillar segmentation, we reduce the number of sparse history features that our model needs to fuse into its current frame by 10$\times$. We also propose a stochastic-length FrameDrop training technique, which generalizes the model to variable frame lengths at inference for improved performance without retraining. We evaluate our method on the widely adopted Waymo Open Dataset and demonstrate improvement on 3D object detection against the baseline model, especially for the challenging category of large objects.
翻译:我们提出了一种面向时序激光雷达点云的3D目标检测晚到早递归特征融合方案。其主要动机是将感知目标潜在嵌入融合到3D目标检测器的早期阶段。与直接从原始点云学习相比,该特征融合策略使模型能够更有效地捕捉复杂目标的形状与姿态。我们的方法以递归方式实现晚到早特征融合,核心在于对时空校准后的稀疏柱体令牌施加基于窗口的注意力机制。通过利用鸟瞰前景柱体分割,模型需要融合到当前帧的稀疏历史特征数量减少了10倍。我们还提出了一种随机长度帧丢弃训练技术,使模型在推理时能够泛化到可变帧长度,无需重新训练即可提升性能。我们在广泛采用的Waymo开放数据集上评估了该方法,结果表明其相比于基线模型在3D目标检测任务上取得了改进,尤其对于具有挑战性的大目标类别表现显著。