Cooperative perception offers an optimal solution to overcome the perception limitations of single-agent systems by leveraging Vehicle-to-Everything (V2X) communication for data sharing and fusion across multiple agents. However, most existing approaches focus on single-modality data exchange, limiting the potential of both homogeneous and heterogeneous fusion across agents. This overlooks the opportunity to utilize multi-modality data per agent, restricting the system's performance. In the automotive industry, manufacturers adopt diverse sensor configurations, resulting in heterogeneous combinations of sensor modalities across agents. To harness the potential of every possible data source for optimal performance, we design a robust LiDAR and camera cross-modality fusion module, Radian-Glue-Attention (RG-Attn), applicable to both intra-agent cross-modality fusion and inter-agent cross-modality fusion scenarios, owing to the convenient coordinate conversion by transformation matrix and the unified sampling/inversion mechanism. We also propose two different architectures, named Paint-To-Puzzle (PTP) and Co-Sketching-Co-Coloring (CoS-CoCo), for conducting cooperative perception. PTP aims for maximum precision performance and achieves smaller data packet size by limiting cross-agent fusion to a single instance, but requiring all participants to be equipped with LiDAR. In contrast, CoS-CoCo supports agents with any configuration-LiDAR-only, camera-only, or LiDAR-camera-both, presenting more generalization ability. Our approach achieves state-of-the-art (SOTA) performance on both real and simulated cooperative perception datasets. The code is now available at GitHub.
翻译:协同感知通过车联网(V2X)通信实现多智能体间的数据共享与融合,为克服单智能体系统感知局限提供了最优解。然而,现有方法多集中于单模态数据交换,限制了智能体间同质与异质融合的潜力,未能充分利用每个智能体自身的多模态数据,从而制约了系统性能。汽车行业中,制造商采用多样化的传感器配置,导致不同智能体间存在传感器模态的异质组合。为挖掘所有潜在数据源以实现最优性能,我们设计了一个鲁棒的激光雷达与相机跨模态融合模块——弧度胶合注意力(RG-Attn)。该模块借助变换矩阵实现的便捷坐标转换及统一的采样/反演机制,可同时适用于智能体内跨模态融合与智能体间跨模态融合场景。我们进一步提出两种不同架构——拼图式绘制(PTP)与协同草图-协同着色(CoS-CoCo)——以执行协同感知任务。PTP以追求极致精度为目标,通过将跨智能体融合限制在单一实例实现更小的数据包尺寸,但要求所有参与者均配备激光雷达。相比之下,CoS-CoCo支持任意传感器配置的智能体(仅激光雷达、仅相机或二者兼备),展现出更强的泛化能力。我们的方法在真实与仿真的协同感知数据集上均达到了最先进的性能。相关代码已在GitHub开源。