Multimodal camera-LiDAR fusion technology has found extensive application in 3D object detection, demonstrating encouraging performance. However, existing methods exhibit significant performance degradation in challenging scenarios characterized by sensor degradation or environmental disturbances. We propose a novel Adaptive Gated Fusion (AG-Fusion) approach that selectively integrates cross-modal knowledge by identifying reliable patterns for robust detection in complex scenes. Specifically, we first project features from each modality into a unified BEV space and enhance them using a window-based attention mechanism. Subsequently, an adaptive gated fusion module based on cross-modal attention is designed to integrate these features into reliable BEV representations robust to challenging environments. Furthermore, we construct a new dataset named Excavator3D (E3D) focusing on challenging excavator operation scenarios to benchmark performance in complex conditions. Our method not only achieves competitive performance on the standard KITTI dataset with 93.92% accuracy, but also significantly outperforms the baseline by 24.88% on the challenging E3D dataset, demonstrating superior robustness to unreliable modal information in complex industrial scenes.
翻译:多模态相机-激光雷达融合技术已在三维目标检测中得到广泛应用,并展现出令人鼓舞的性能。然而,现有方法在传感器退化或环境干扰等挑战性场景中表现出显著的性能下降。本文提出了一种新颖的自适应门控融合方法,该方法通过识别可靠模式来选择性整合跨模态知识,以实现复杂场景下的鲁棒检测。具体而言,我们首先将各模态特征投影至统一的鸟瞰图空间,并利用基于窗口的注意力机制对其进行增强。随后,设计了一个基于跨模态注意力的自适应门控融合模块,将这些特征整合为对挑战性环境具有鲁棒性的可靠鸟瞰图表征。此外,我们构建了一个专注于挖掘机作业挑战性场景的新数据集Excavator3D,用于评估复杂条件下的性能基准。我们的方法不仅在标准KITTI数据集上取得了93.92%的竞争性精度,更在具有挑战性的E3D数据集上以24.88%的显著优势超越基线方法,展现出对复杂工业场景中不可靠模态信息的卓越鲁棒性。