Combining LiDAR and camera data has shown potential in enhancing short-distance object detection in autonomous driving systems. Yet, the fusion encounters difficulties with extended distance detection due to the contrast between LiDAR's sparse data and the dense resolution of cameras. Besides, discrepancies in the two data representations further complicate fusion methods. We introduce AYDIV, a novel framework integrating a tri-phase alignment process specifically designed to enhance long-distance detection even amidst data discrepancies. AYDIV consists of the Global Contextual Fusion Alignment Transformer (GCFAT), which improves the extraction of camera features and provides a deeper understanding of large-scale patterns; the Sparse Fused Feature Attention (SFFA), which fine-tunes the fusion of LiDAR and camera details; and the Volumetric Grid Attention (VGA) for a comprehensive spatial data fusion. AYDIV's performance on the Waymo Open Dataset (WOD) with an improvement of 1.24% in mAPH value(L2 difficulty) and the Argoverse2 Dataset with a performance improvement of 7.40% in AP value demonstrates its efficacy in comparison to other existing fusion-based methods. Our code is publicly available at https://github.com/sanjay-810/AYDIV2
翻译:结合激光雷达与摄像头数据在提升自动驾驶系统近距离目标检测方面展现出潜力。然而,由于激光雷达稀疏数据与摄像头密集分辨率之间的差异,这种融合在远距离检测中面临挑战。此外,两种数据表征的差异性进一步增加了融合方法的复杂性。我们提出AYDIV,一种集成三相对齐过程的新型框架,专门用于在数据差异存在的情况下增强远距离检测能力。AYDIV包含:全局上下文融合对齐Transformer(GCFAT),用于提升摄像头特征提取并加深对大范围模式的理解;稀疏融合特征注意力机制(SFFA),用于精细调节激光雷达与摄像头细节的融合;以及体素网格注意力机制(VGA),实现全面的空间数据融合。在Waymo开放数据集(WOD)上(L2难度mAPH值提升1.24%)和Argoverse2数据集上(AP值提升7.40%)的实验结果表明,相较于其他基于融合的方法,AYDIV展现了其有效性。我们的代码已在https://github.com/sanjay-810/AYDIV2 开源。