SHMamba: Structured Hyperbolic State Space Model for Audio-Visual Question Answering

The Audio-Visual Question Answering (AVQA) task holds significant potential for applications. Compared to traditional unimodal approaches, the multi-modal input of AVQA makes feature extraction and fusion processes more challenging. Euclidean space is difficult to effectively represent multi-dimensional relationships of data. Especially when extracting and processing data with a tree structure or hierarchical structure, Euclidean space is not suitable as an embedding space. Additionally, the self-attention mechanism in Transformers is effective in capturing the dynamic relationships between elements in a sequence. However, the self-attention mechanism's limitations in window modeling and quadratic computational complexity reduce its effectiveness in modeling long sequences. To address these limitations, we propose SHMamba: Structured Hyperbolic State Space Model to integrate the advantages of hyperbolic geometry and state space models. Specifically, SHMamba leverages the intrinsic properties of hyperbolic space to represent hierarchical structures and complex relationships in audio-visual data. Meanwhile, the state space model captures dynamic changes over time by globally modeling the entire sequence. Furthermore, we introduce an adaptive curvature hyperbolic alignment module and a cross fusion block to enhance the understanding of hierarchical structures and the dynamic exchange of cross-modal information, respectively. Extensive experiments demonstrate that SHMamba outperforms previous methods with fewer parameters and computational costs. Our learnable parameters are reduced by 78.12\%, while the average performance improves by 2.53\%. Experiments show that our method demonstrates superiority among all current major methods and is more suitable for practical application scenarios.

翻译：视听问答（AVQA）任务具有重要的应用潜力。与传统单模态方法相比，AVQA的多模态输入使得特征提取与融合过程更具挑战性。欧几里得空间难以有效表征数据的高维关系，尤其在提取和处理具有树状结构或层次结构的数据时，欧几里得空间不适合作为嵌入空间。此外，Transformer中的自注意力机制虽能有效捕捉序列元素间的动态关系，但其在窗口建模方面的局限性及二次计算复杂度降低了其在长序列建模中的效率。为应对这些局限，我们提出SHMamba：一种结构化双曲状态空间模型，以整合双曲几何与状态空间模型的优势。具体而言，SHMamba利用双曲空间的内禀特性来表征视听数据中的层次结构与复杂关系；同时，状态空间模型通过对整个序列进行全局建模来捕捉随时间变化的动态特征。此外，我们引入了自适应曲率双曲对齐模块与跨模态融合块，分别用于增强对层次结构的理解以及跨模态信息的动态交互。大量实验表明，SHMamba以更少的参数量和计算成本超越了现有方法。我们的可学习参数减少了78.12%，而平均性能提升了2.53%。实验证明，本方法在当前所有主流方法中均表现出优越性，且更适用于实际应用场景。