Recent advances in 4D imaging radar have enabled robust perception in adverse weather, while camera sensors provide dense semantic information. Fusing the these complementary modalities has great potential for cost-effective 3D perception. However, most existing camera-radar fusion methods are limited to single-frame inputs, capturing only a partial view of the scene. The incomplete scene information, compounded by image degradation and 4D radar sparsity, hinders overall detection performance. In contrast, multi-frame fusion offers richer spatiotemporal information but faces two challenges: achieving robust and effective object feature fusion across frames and modalities, and mitigating the computational cost of redundant feature extraction. Consequently, we propose M^3Detection, a unified multi-frame 3D object detection framework that performs multi-level feature fusion on multi-modal data from camera and 4D imaging radar. Our framework leverages intermediate features from the baseline detector and employs the tracker to produce reference trajectories, improving computational efficiency and providing richer information for second-stage. In the second stage, we design a global-level inter-object feature aggregation module guided by radar information to align global features across candidate proposals and a local-level inter-grid feature aggregation module that expands local features along the reference trajectories to enhance fine-grained object representation. The aggregated features are then processed by a trajectory-level multi-frame spatiotemporal reasoning module to encode cross-frame interactions and enhance temporal representation. Extensive experiments on the VoD and TJ4DRadSet datasets demonstrate that M^3Detection achieves state-of-the-art 3D detection performance, validating its effectiveness in multi-frame detection with camera-4D imaging radar fusion.
翻译:近期4D成像雷达的发展使其能够在恶劣天气条件下实现鲁棒感知,而相机传感器则提供密集的语义信息。融合这两种互补模态对于实现高性价比的三维感知具有巨大潜力。然而,现有的大多数相机-雷达融合方法仅限于单帧输入,仅能捕捉场景的部分信息。这种不完整的场景信息,加之图像质量退化与4D雷达数据稀疏性,制约了整体检测性能。相比之下,多帧融合能提供更丰富的时空信息,但面临两大挑战:如何实现跨帧跨模态的鲁棒有效目标特征融合,以及如何降低冗余特征提取的计算成本。为此,我们提出M^3Detection——一个统一的多帧三维目标检测框架,对相机与4D成像雷达的多模态数据执行多层级特征融合。该框架利用基线检测器的中间特征,并借助跟踪器生成参考轨迹,从而提升计算效率并为第二阶段提供更丰富的信息。在第二阶段中,我们设计了雷达信息引导的全局级目标间特征聚合模块,用于对齐候选提案的全局特征;同时提出局部级网格间特征聚合模块,沿参考轨迹扩展局部特征以增强细粒度目标表征。聚合后的特征通过轨迹级多帧时空推理模块处理,以编码跨帧交互并增强时序表征。在VoD和TJ4DRadSet数据集上的大量实验表明,M^3Detection实现了最先进的三维检测性能,验证了其在相机-4D成像雷达融合的多帧检测任务中的有效性。