By sharing complementary perceptual information, multi-agent collaborative perception fosters a deeper understanding of the environment. Recent studies on collaborative perception mostly utilize CNNs or Transformers to learn feature representation and fusion in the spatial dimension, which struggle to handle long-range spatial-temporal features under limited computing and communication resources. Holistically modeling the dependencies over extensive spatial areas and extended temporal frames is crucial to enhancing feature quality. To this end, we propose a resource efficient cross-agent spatial-temporal collaborative state space model (SSM), named CollaMamba. Initially, we construct a foundational backbone network based on spatial SSM. This backbone adeptly captures positional causal dependencies from both single-agent and cross-agent views, yielding compact and comprehensive intermediate features while maintaining linear complexity. Furthermore, we devise a history-aware feature boosting module based on temporal SSM, extracting contextual cues from extended historical frames to refine vague features while preserving low overhead. Extensive experiments across several datasets demonstrate that CollaMamba outperforms state-of-the-art methods, achieving higher model accuracy while reducing computational and communication overhead by up to 71.9% and 1/64, respectively. This work pioneers the exploration of the Mamba's potential in collaborative perception. The source code will be made available.
翻译:通过共享互补的感知信息,多智能体协同感知能够促进对环境更深入的理解。现有的协同感知研究大多利用CNN或Transformer来学习空间维度的特征表示与融合,这些方法在有限的计算和通信资源下难以处理长程时空特征。对广阔空间区域与连续时间帧的依赖关系进行整体建模,对于提升特征质量至关重要。为此,我们提出一种资源高效的跨智能体时空协同状态空间模型(SSM),命名为CollaMamba。首先,我们基于空间SSM构建了一个基础骨干网络。该骨干网络能够从单智能体与跨智能体视角有效捕捉位置因果依赖,在保持线性复杂度的同时生成紧凑而全面的中间特征。进一步,我们设计了一个基于时序SSM的历史感知特征增强模块,从扩展的历史帧中提取上下文线索以优化模糊特征,同时维持较低的开销。在多个数据集上的大量实验表明,CollaMamba优于现有最先进方法,在实现更高模型精度的同时,分别将计算与通信开销降低了最高达71.9%和1/64。本工作开创性地探索了Mamba在协同感知中的潜力。源代码将公开提供。