Vehicle-to-Vehicle technologies have enabled autonomous vehicles to share information to see through occlusions, greatly enhancing perception performance. Nevertheless, existing works all focused on homogeneous traffic where vehicles are equipped with the same type of sensors, which significantly hampers the scale of collaboration and benefit of cross-modality interactions. In this paper, we investigate the multi-agent hetero-modal cooperative perception problem where agents may have distinct sensor modalities. We present HM-ViT, the first unified multi-agent hetero-modal cooperative perception framework that can collaboratively predict 3D objects for highly dynamic vehicle-to-vehicle (V2V) collaborations with varying numbers and types of agents. To effectively fuse features from multi-view images and LiDAR point clouds, we design a novel heterogeneous 3D graph transformer to jointly reason inter-agent and intra-agent interactions. The extensive experiments on the V2V perception dataset OPV2V demonstrate that the HM-ViT outperforms SOTA cooperative perception methods for V2V hetero-modal cooperative perception. We will release codes to facilitate future research.
翻译:车-车(V2V)技术使自动驾驶车辆能够共享信息以穿透遮挡,显著提升了感知性能。然而,现有研究均聚焦于同质化交通场景——即所有车辆搭载相同类型的传感器,这严重限制了协同规模及跨模态交互的效益。本文研究了多智能体异模态协同感知问题,其中各智能体可能配备不同传感器模态。我们提出HM-ViT,这是首个统一的多智能体异模态协同感知框架,能够在智能体数量与类型动态变化的车-车(V2V)协作中协同预测3D目标。为有效融合多视角图像与激光雷达点云特征,我们设计了一种新型异质3D图Transformer,用于联合推理智能体间与智能体内交互。在V2V感知数据集OPV2V上的大量实验表明,HM-ViT在V2V异模态协同感知任务中优于现有最先进的协同感知方法。我们将公开代码以促进未来研究。