The Transformer model, particularly its cross-attention module, is widely used for feature fusion in target sound extraction which extracts the signal of interest based on given clues. Despite its effectiveness, this approach suffers from low computational efficiency. Recent advancements in state space models, notably the latest work Mamba, have shown comparable performance to Transformer-based methods while significantly reducing computational complexity in various tasks. However, Mamba's applicability in target sound extraction is limited due to its inability to capture dependencies between different sequences as the cross-attention does. In this paper, we propose CrossMamba for target sound extraction, which leverages the hidden attention mechanism of Mamba to compute dependencies between the given clues and the audio mixture. The calculation of Mamba can be divided to the query, key and value. We utilize the clue to generate the query and the audio mixture to derive the key and value, adhering to the principle of the cross-attention mechanism in Transformers. Experimental results from two representative target sound extraction methods validate the efficacy of the proposed CrossMamba.
翻译:Transformer模型,尤其是其交叉注意力模块,在目标声音提取任务中被广泛用于特征融合,该任务旨在根据给定线索提取感兴趣的信号。尽管该方法有效,但其计算效率较低。状态空间模型的最新进展,特别是最新的Mamba模型,已在多项任务中展现出与基于Transformer的方法相当的性能,同时显著降低了计算复杂度。然而,由于Mamba无法像交叉注意力那样捕捉不同序列之间的依赖关系,其在目标声音提取中的应用受到限制。本文提出用于目标声音提取的CrossMamba模型,该模型利用Mamba的隐藏注意力机制来计算给定线索与音频混合信号之间的依赖关系。Mamba的计算可分解为查询、键和值。我们遵循Transformer中交叉注意力机制的原理,利用线索生成查询,并利用音频混合信号推导键和值。基于两种代表性目标声音提取方法的实验结果验证了所提CrossMamba的有效性。