Augmenting LiDAR input with multiple previous frames provides richer semantic information and thus boosts performance in 3D object detection, However, crowded point clouds in multi-frames can hurt the precise position information due to the motion blur and inaccurate point projection. In this work, we propose a novel feature fusion strategy, DynStaF (Dynamic-Static Fusion), which enhances the rich semantic information provided by the multi-frame (dynamic branch) with the accurate location information from the current single-frame (static branch). To effectively extract and aggregate complimentary features, DynStaF contains two modules, Neighborhood Cross Attention (NCA) and Dynamic-Static Interaction (DSI), operating through a dual pathway architecture. NCA takes the features in the static branch as queries and the features in the dynamic branch as keys (values). When computing the attention, we address the sparsity of point clouds and take only neighborhood positions into consideration. NCA fuses two features at different feature map scales, followed by DSI providing the comprehensive interaction. To analyze our proposed strategy DynStaF, we conduct extensive experiments on the nuScenes dataset. On the test set, DynStaF increases the performance of PointPillars in NDS by a large margin from 57.7% to 61.6%. When combined with CenterPoint, our framework achieves 61.0% mAP and 67.7% NDS, leading to state-of-the-art performance without bells and whistles.
翻译:通过将LiDAR输入与多个先前帧进行融合,能够提供更丰富的语义信息,从而提升三维目标检测的性能。然而,多帧导致的点云拥挤可能因运动模糊和点投影不精确而损害精确位置信息。本文提出了一种新的特征融合策略DynStaF(动态-静态融合),它通过当前单帧(静态分支)提供的精确位置信息来增强多帧(动态分支)提供的丰富语义信息。为有效提取和聚合互补特征,DynStaF包含两个模块:邻域交叉注意力(NCA)和动态-静态交互(DSI),通过双路径架构运行。NCA以静态分支中的特征为查询,动态分支中的特征为键(值)。在计算注意力时,我们考虑了点云的稀疏性,仅关注邻域位置。NCA在不同特征图尺度上融合两种特征,随后由DSI提供全面的交互。为分析所提出的DynStaF策略,我们在nuScenes数据集上进行了大量实验。在测试集上,DynStaF将PointPillars的NDS性能从57.7%大幅提升至61.6%。当与CenterPoint结合时,我们的框架实现了61.0%的mAP和67.7%的NDS,在不使用任何花哨技巧的情况下达到了最先进的性能。