RG-Attn: Radian Glue Attention for Multi-modality Multi-agent Cooperative Perception

Cooperative perception offers an optimal solution to overcome the perception limitations of single-agent systems by leveraging Vehicle-to-Everything (V2X) communication for data sharing and fusion across multiple agents. However, most existing approaches focus on single-modality data exchange, limiting the potential of both homogeneous and heterogeneous fusion across agents. This overlooks the opportunity to utilize multi-modality data per agent, restricting the system's performance. In the automotive industry, manufacturers adopt diverse sensor configurations, resulting in heterogeneous combinations of sensor modalities across agents. To harness the potential of every possible data source for optimal performance, we design a robust LiDAR and camera cross-modality fusion module, Radian-Glue-Attention (RG-Attn), applicable to both intra-agent cross-modality fusion and inter-agent cross-modality fusion scenarios, owing to the convenient coordinate conversion by transformation matrix and the unified sampling/inversion mechanism. We also propose two different architectures, named Paint-To-Puzzle (PTP) and Co-Sketching-Co-Coloring (CoS-CoCo), for conducting cooperative perception. PTP aims for maximum precision performance and achieves smaller data packet size by limiting cross-agent fusion to a single instance, but requiring all participants to be equipped with LiDAR. In contrast, CoS-CoCo supports agents with any configuration-LiDAR-only, camera-only, or LiDAR-camera-both, presenting more generalization ability. Our approach achieves state-of-the-art (SOTA) performance on both real and simulated cooperative perception datasets. The code will be released at GitHub in early 2025.

翻译：协同感知通过车联网（V2X）通信实现多智能体间的数据共享与融合，为克服单智能体系统感知局限提供了最优解。然而，现有方法多集中于单模态数据交换，限制了智能体间同质与异质融合的潜力，未能充分利用各智能体自身的多模态数据，从而制约了系统性能。汽车行业中，制造商采用多样化的传感器配置，导致不同智能体间存在传感器模态的异质组合。为充分利用所有潜在数据源以实现最优性能，我们设计了一个鲁棒的激光雷达与相机跨模态融合模块——弧度胶合注意力（RG-Attn）。该模块借助变换矩阵实现的便捷坐标转换以及统一的采样/反演机制，可同时适用于智能体内跨模态融合与智能体间跨模态融合场景。我们还提出了两种不同的协同感知架构：绘画拼图（PTP）与协同素描协同着色（CoS-CoCo）。PTP以追求极致精度为目标，通过将跨智能体融合限制在单一实例来实现更小的数据包尺寸，但要求所有参与者均配备激光雷达。相比之下，CoS-CoCo支持任意配置的智能体（仅激光雷达、仅相机或二者兼备），展现出更强的泛化能力。我们的方法在真实与仿真的协同感知数据集上均达到了最先进（SOTA）性能。代码将于2025年初在GitHub上开源。