Embodied perception is essential for intelligent vehicles and robots, enabling more natural interaction and task execution. However, these advancements currently embrace vision level, rarely focusing on using 3D modeling sensors, which limits the full understanding of surrounding objects with multi-granular characteristics. Recently, as a promising automotive sensor with affordable cost, 4D Millimeter-Wave radar provides denser point clouds than conventional radar and perceives both semantic and physical characteristics of objects, thus enhancing the reliability of perception system. To foster the development of natural language-driven context understanding in radar scenes for 3D grounding, we construct the first dataset, Talk2Radar, which bridges these two modalities for 3D Referring Expression Comprehension. Talk2Radar contains 8,682 referring prompt samples with 20,558 referred objects. Moreover, we propose a novel model, T-RadarNet for 3D REC upon point clouds, achieving state-of-the-art performances on Talk2Radar dataset compared with counterparts, where Deformable-FPN and Gated Graph Fusion are meticulously designed for efficient point cloud feature modeling and cross-modal fusion between radar and text features, respectively. Further, comprehensive experiments are conducted to give a deep insight into radar-based 3D REC. We release our project at https://github.com/GuanRunwei/Talk2Radar.
翻译:具身感知对于智能车辆与机器人至关重要,能够实现更自然的交互与任务执行。然而,当前进展主要集中在视觉层面,鲜少关注利用三维建模传感器,这限制了从多粒度特性上全面理解周围物体。近年来,作为一种成本适中的新兴车载传感器,4D毫米波雷达相比传统雷达能提供更密集的点云数据,并可同时感知物体的语义与物理特性,从而提升了感知系统的可靠性。为促进自然语言驱动的雷达场景三维空间定位理解,我们构建了首个数据集Talk2Radar,该数据集桥接了两种模态以实现三维指称表达理解。Talk2Radar包含8,682个指称提示样本与20,558个被指称物体。此外,我们提出一种基于点云的新型模型T-RadarNet用于三维REC,该模型在Talk2Radar数据集上相较同类方法取得了最优性能。其中,可变形特征金字塔网络与门控图融合模块分别被精心设计用于高效点云特征建模及雷达-文本跨模态融合。进一步,我们通过全面的实验深入分析了基于雷达的三维REC方法。相关项目已发布在https://github.com/GuanRunwei/Talk2Radar。