Infrastructure-based perception plays a crucial role in intelligent transportation systems, offering global situational awareness and enabling cooperative autonomy. However, existing camera-based detection models often underperform in such scenarios due to challenges such as multi-view infrastructure setup, diverse camera configurations, degraded visual inputs, and various road layouts. We introduce MIC-BEV, a Transformer-based bird's-eye-view (BEV) perception framework for infrastructure-based multi-camera 3D object detection. MIC-BEV flexibly supports a variable number of cameras with heterogeneous intrinsic and extrinsic parameters and demonstrates strong robustness under sensor degradation. The proposed graph-enhanced fusion module in MIC-BEV integrates multi-view image features into the BEV space by exploiting geometric relationships between cameras and BEV cells alongside latent visual cues. To support training and evaluation, we introduce M2I, a synthetic dataset for infrastructure-based object detection, featuring diverse camera configurations, road layouts, and environmental conditions. Extensive experiments on both M2I and the real-world dataset RoScenes demonstrate that MIC-BEV achieves state-of-the-art performance in 3D object detection. It also remains robust under challenging conditions, including extreme weather and sensor degradation. These results highlight the potential of MIC-BEV for real-world deployment. The dataset and source code are available at: https://github.com/HandsomeYun/MIC-BEV.
翻译:基础设施感知在智能交通系统中扮演着关键角色,提供全局态势感知并支持协同自主。然而,现有基于相机的检测模型在此类场景中常表现不佳,原因包括多视角基础设施设置、多样化相机配置、视觉输入质量下降以及不同道路布局等挑战。本文提出MIC-BEV,一种基于Transformer的鸟瞰图感知框架,用于基础设施多相机三维目标检测。MIC-BEV灵活支持具有异构内外参数的可变数量相机,并在传感器性能退化时表现出强鲁棒性。所提出的图增强融合模块通过挖掘相机与BEV网格间的几何关系及潜在视觉线索,将多视角图像特征集成到BEV空间中。为支持训练与评估,我们构建了M2I合成数据集,专门用于基础设施目标检测,涵盖多样化的相机配置、道路布局和环境条件。在M2I和真实数据集RoScenes上的大量实验表明,MIC-BEV在三维目标检测中达到了最先进的性能,并在极端天气和传感器退化等挑战条件下保持鲁棒性。这些结果凸显了MIC-BEV在实际部署中的潜力。数据集与源代码已发布于:https://github.com/HandsomeYun/MIC-BEV。