In autonomous driving, cooperative perception makes use of multi-view cameras from both vehicles and infrastructure, providing a global vantage point with rich semantic context of road conditions beyond a single vehicle viewpoint. Currently, two major challenges persist in vehicle-infrastructure cooperative 3D (VIC3D) object detection: $1)$ inherent pose errors when fusing multi-view images, caused by time asynchrony across cameras; $2)$ information loss in transmission process resulted from limited communication bandwidth. To address these issues, we propose a novel camera-based 3D detection framework for VIC3D task, Enhanced Multi-scale Image Feature Fusion (EMIFF). To fully exploit holistic perspectives from both vehicles and infrastructure, we propose Multi-scale Cross Attention (MCA) and Camera-aware Channel Masking (CCM) modules to enhance infrastructure and vehicle features at scale, spatial, and channel levels to correct the pose error introduced by camera asynchrony. We also introduce a Feature Compression (FC) module with channel and spatial compression blocks for transmission efficiency. Experiments show that EMIFF achieves SOTA on DAIR-V2X-C datasets, significantly outperforming previous early-fusion and late-fusion methods with comparable transmission costs.
翻译:在自动驾驶中,协同感知利用车辆和基础设施的多视角摄像头,提供超越单车视角的全局视野与丰富的道路环境语义信息。当前,车路协同3D(VIC3D)目标检测存在两大挑战:1)多视角图像融合时因相机时间异步造成的固有位姿误差;2)有限通信带宽导致传输过程中的信息损失。针对这些问题,我们提出了一种全新的基于摄像头的VIC3D检测框架——增强型多尺度图像特征融合(EMIFF)。为充分挖掘车辆与基础设施的整体视角,我们设计了多尺度交叉注意力(MCA)和相机感知通道掩蔽(CCM)模块,分别在尺度、空间和通道层级增强基础设施与车辆特征,以校正相机异步引入的位姿误差。同时,我们引入包含通道压缩与空间压缩模块的特征压缩(FC)组件,用于提升传输效率。实验表明,EMIFF在DAIR-V2X-C数据集上达到最先进水平,在传输成本相当的情况下显著优于此前的前融合与后融合方法。