M2I2HA: A Multi-modal Object Detection Method Based on Intra- and Inter-Modal Hypergraph Attention

Recent advances in multi-modal detection have significantly improved detection accuracy in challenging environments (e.g., low light, overexposure). By integrating RGB with modalities such as thermal and depth, multi-modal fusion increases data redundancy and system robustness. However, significant challenges remain in effectively extracting task-relevant information both within and across modalities, as well as in achieving precise cross-modal alignment. While CNNs excel at feature extraction, they are limited by constrained receptive fields, strong inductive biases, and difficulty in capturing long-range dependencies. Transformer-based models offer global context but suffer from quadratic computational complexity and are confined to pairwise correlation modeling. Mamba and other State Space Models (SSMs), on the other hand, are hindered by their sequential scanning mechanism, which flattens 2D spatial structures into 1D sequences, disrupting topological relationships and limiting the modeling of complex higher-order dependencies. To address these issues, we propose a multi-modal perception network based on hypergraph theory called M2I2HA. Our architecture includes an Intra-Hypergraph Enhancement module to capture global many-to-many high-order relationships within each modality, and an Inter-Hypergraph Fusion module to align, enhance, and fuse cross-modal features by bridging configuration and spatial gaps between data sources. We further introduce a M2-FullPAD module to enable adaptive multi-level fusion of multi-modal enhanced features within the network, meanwhile enhancing data distribution and flow across the architecture. Extensive object detection experiments on multiple public datasets against baselines demonstrate that M2I2HA achieves state-of-the-art performance in multi-modal object detection tasks.

翻译：近年来，多模态检测技术的进步显著提升了在挑战性环境（如低光照、过曝光）下的检测精度。通过将RGB与热成像、深度等模态相结合，多模态融合增加了数据冗余和系统鲁棒性。然而，如何有效提取模态内部及跨模态间的任务相关信息，并实现精确的跨模态对齐，仍然面临重大挑战。虽然CNN在特征提取方面表现出色，但其受限于有限的感受野、强归纳偏置以及难以捕获长程依赖关系。基于Transformer的模型提供了全局上下文，但存在二次计算复杂度的局限，且仅限于成对相关性建模。另一方面，Mamba及其他状态空间模型（SSMs）则受限于其顺序扫描机制，该机制将二维空间结构展平为一维序列，破坏了拓扑关系并限制了对复杂高阶依赖关系的建模。为解决这些问题，我们提出了一种基于超图理论的多模态感知网络，称为M2I2HA。我们的架构包含一个模态内超图增强模块，用于捕获每个模态内部的全局多对多高阶关系；以及一个模态间超图融合模块，通过弥合数据源之间的配置与空间差距，来对齐、增强并融合跨模态特征。我们进一步引入了一个M2-FullPAD模块，以实现网络内多模态增强特征的自适应多级融合，同时增强整个架构中的数据分布与流动。在多个公开数据集上进行的广泛目标检测实验表明，与基线方法相比，M2I2HA在多模态目标检测任务中实现了最先进的性能。