3D object detection and occupancy prediction are critical tasks in autonomous driving, attracting significant attention. Despite the potential of recent vision-based methods, they encounter challenges under adverse conditions. Thus, integrating cameras with next-generation 4D imaging radar to achieve unified multi-task perception is highly significant, though research in this domain remains limited. In this paper, we propose Doracamom, the first framework that fuses multi-view cameras and 4D radar for joint 3D object detection and semantic occupancy prediction, enabling comprehensive environmental perception. Specifically, we introduce a novel Coarse Voxel Queries Generator that integrates geometric priors from 4D radar with semantic features from images to initialize voxel queries, establishing a robust foundation for subsequent Transformer-based refinement. To leverage temporal information, we design a Dual-Branch Temporal Encoder that processes multi-modal temporal features in parallel across BEV and voxel spaces, enabling comprehensive spatio-temporal representation learning. Furthermore, we propose a Cross-Modal BEV-Voxel Fusion module that adaptively fuses complementary features through attention mechanisms while employing auxiliary tasks to enhance feature quality. Extensive experiments on the OmniHD-Scenes, View-of-Delft (VoD), and TJ4DRadSet datasets demonstrate that Doracamom achieves state-of-the-art performance in both tasks, establishing new benchmarks for multi-modal 3D perception. Code and models will be publicly available.
翻译:三维目标检测与占据预测是自动驾驶领域的关键任务,受到广泛关注。尽管近期基于视觉的方法展现出潜力,但在恶劣条件下仍面临挑战。因此,将相机与新一代4D成像雷达相融合以实现统一的多任务感知具有重要意义,然而该领域的研究仍较为有限。本文提出Doracamom,首个融合多视角相机与4D雷达进行联合三维目标检测与语义占据预测的框架,以实现全面的环境感知。具体而言,我们提出一种新颖的粗粒度体素查询生成器,将4D雷达的几何先验与图像的语义特征相结合以初始化体素查询,为后续基于Transformer的精细化处理奠定坚实基础。为利用时序信息,我们设计了一种双分支时序编码器,在BEV与体素空间中并行处理多模态时序特征,实现全面的时空表征学习。此外,我们提出一种跨模态BEV-体素融合模块,通过注意力机制自适应融合互补特征,同时采用辅助任务提升特征质量。在OmniHD-Scenes、View-of-Delft (VoD)和TJ4DRadSet数据集上的大量实验表明,Doracamom在两项任务中均达到最先进的性能,为多模态三维感知建立了新的基准。代码与模型将公开发布。