We present SAM, a State-space Audio-language Model that integrates an audio encoder with a Mamba-2 backbone. SAM-2.7B achieves 21.1 mAP on AudioSet and 17.6 SPICE on AudioCaps, matching or surpassing larger 7B transformer-based models with fewer parameters. We further provide the first systematic, representation-level analysis of how SSMs interact with audio encoder outputs: (1) joint audio encoder finetuning is essential, supported by accuracy gains and observed adaptation of token representation rank and similarity across different SSM sizes; (2) despite linear scaling, SSMs benefit more from compact, information-rich audio token representations than from excessively long token sequences; and (3) incorporating instruction-following supervision substantially improves reasoning ability, boosting MMAU-Sound accuracy from 22.8 to 56.8. Through comprehensive experiments and analysis, we establish practical design principles for SSMs as strong, scalable backbones for audio-language models.
翻译:我们提出SAM,一种融合Mamba-2主干的音频编码器状态空间音频-语言模型。SAM-2.7B在AudioSet上达到21.1 mAP,在AudioCaps上达到17.6 SPICE,以更少的参数超越或比肩更大的7B参数Transformer模型。我们进一步首次从表征层面系统分析了SSM与音频编码器输出的交互机制:(1)联合微调音频编码器至关重要,这由精度提升及不同规模SSM中令牌表征秩与相似度的适应性变化所证实;(2)尽管SSM具有线性扩展特性,其从紧凑、高信息密度的音频令牌表征中获益更多,而非过度冗长的令牌序列;(3)融入指令跟随监督显著增强推理能力,将MMAU-Sound精度从22.8提升至56.8。通过全面实验与分析,我们为SSM作为强大、可扩展的音频-语言模型主干建立了实用设计原则。