In automotive sensor fusion systems, smart sensors and Vehicle-to-Everything (V2X) modules are commonly utilized. Sensor data from these systems are typically available only as processed object lists rather than raw sensor data from traditional sensors. Instead of processing other raw data separately and then fusing them at the object level, we propose an end-to-end cross-level fusion concept with Transformer, which integrates highly abstract object list information with raw camera images for 3D object detection. Object lists are fed into a Transformer as denoising queries and propagated together with learnable queries through the latter feature aggregation process. Additionally, a deformable Gaussian mask, derived from the positional and size dimensional priors from the object lists, is explicitly integrated into the Transformer decoder. This directs attention toward the target area of interest and accelerates model training convergence. Furthermore, as there is no public dataset containing object lists as a standalone modality, we propose an approach to generate pseudo object lists from ground-truth bounding boxes by simulating state noise and false positives and negatives. As the first work to conduct cross-level fusion, our approach shows substantial performance improvements over the vision-based baseline on the nuScenes dataset. It demonstrates its generalization capability over diverse noise levels of simulated object lists and real detectors.
翻译:在汽车传感器融合系统中,智能传感器与车联网(V2X)模块被广泛采用。此类系统提供的传感器数据通常仅以处理后对象列表的形式存在,而非传统传感器的原始数据。不同于分别处理其他原始数据再在对象层级进行融合的传统方法,我们提出了一种基于Transformer的端到端跨层级融合方案,将高度抽象的对象列表信息与原始相机图像相结合,以实现三维目标检测。对象列表以去噪查询的形式输入Transformer,并与可学习查询共同在后续特征聚合过程中传播。此外,我们基于对象列表中的位置与尺寸维度先验,构建了可变形高斯掩码,并将其显式集成到Transformer解码器中。该设计能够将注意力引导至目标感兴趣区域,并加速模型训练收敛。进一步地,由于目前缺乏包含对象列表作为独立模态的公开数据集,我们提出一种从真实标注边界框生成伪对象列表的方法,通过模拟状态噪声及误检、漏检来实现。作为首个开展跨层级融合的研究,我们的方法在nuScenes数据集上相比基于视觉的基线模型取得了显著的性能提升,并在模拟对象列表的不同噪声水平及真实检测器上展现了良好的泛化能力。