Cross-Modal Binary Attention: An Energy-Efficient Fusion Framework for Audio-Visual Learning

Effective multimodal fusion requires mechanisms that can capture complex cross-modal dependencies while remaining computationally scalable for real-world deployment. Existing audio-visual fusion approaches face a fundamental trade-off: attention-based methods effectively model cross-modal relationships but incur quadratic computational complexity that prevents hierarchical, multi-scale architectures, while efficient fusion strategies rely on simplistic concatenation that fails to extract complementary cross-modal information. We introduce CMQKA, a novel cross-modal fusion mechanism that achieves linear O(N) complexity through efficient binary operations, enabling scalable hierarchical fusion previously infeasible with conventional attention. CMQKA employs bidirectional cross-modal Query-Key attention to extract complementary spatiotemporal features and uses learnable residual fusion to preserve modality-specific characteristics while enriching representations with cross-modal information. Building upon CMQKA, we present SNNergy, an energy-efficient multimodal fusion framework with a hierarchical architecture that processes inputs through progressively decreasing spatial resolutions and increasing semantic abstraction. This multi-scale fusion capability allows the framework to capture both local patterns and global context across modalities. Implemented with event-driven binary spike operations, SNNergy achieves remarkable energy efficiency while maintaining fusion effectiveness and establishing new state-of-the-art results on challenging audio-visual benchmarks, including CREMA-D, AVE, and UrbanSound8K-AV, significantly outperforming existing multimodal fusion baselines. Our framework advances multimodal fusion by introducing a scalable fusion mechanism that enables hierarchical cross-modal integration with practical energy efficiency for real-world audio-visual intelligence systems.

翻译：有效的多模态融合需要能够捕捉复杂跨模态依赖关系，同时保持计算可扩展性以支持实际部署的机制。现有的视听融合方法面临一个根本性权衡：基于注意力的方法能有效建模跨模态关系，但会产生二次计算复杂度，阻碍了分层多尺度架构的实现；而高效的融合策略依赖于简单的拼接操作，无法提取互补的跨模态信息。我们提出CMQKA，一种新颖的跨模态融合机制，通过高效的二元运算实现线性O(N)复杂度，使得以往传统注意力无法实现的可扩展分层融合成为可能。CMQKA采用双向跨模态查询-键注意力来提取互补的时空特征，并利用可学习的残差融合在保留模态特定特征的同时，通过跨模态信息丰富表征。基于CMQKA，我们提出了SNNergy，一种具有分层架构的高能效多模态融合框架，该框架通过逐步降低空间分辨率和提升语义抽象层级来处理输入。这种多尺度融合能力使框架能够捕捉跨模态的局部模式和全局上下文。通过事件驱动的二元脉冲操作实现，SNNergy在保持融合有效性的同时实现了卓越的能效，并在具有挑战性的视听基准测试（包括CREMA-D、AVE和UrbanSound8K-AV）上取得了新的最先进成果，显著超越了现有的多模态融合基线方法。我们的框架通过引入可扩展的融合机制推进了多模态融合领域的发展，该机制能够实现分层跨模态集成，并为实际视听智能系统提供实用的能效优势。