In this work, we extend our previously proposed offline SpatialNet for long-term streaming multichannel speech enhancement in both static and moving speaker scenarios. SpatialNet exploits spatial information, such as the spatial/steering direction of speech, for discriminating between target speech and interferences, and achieved outstanding performance. The core of SpatialNet is a narrow-band self-attention module used for learning the temporal dynamic of spatial vectors. Towards long-term streaming speech enhancement, we propose to replace the offline self-attention network with online networks that have linear inference complexity w.r.t signal length and meanwhile maintain the capability of learning long-term information. Three variants are developed based on (i) masked self-attention, (ii) Retention, a self-attention variant with linear inference complexity, and (iii) Mamba, a structured-state-space-based RNN-like network. Moreover, we investigate the length extrapolation ability of different networks, namely test on signals that are much longer than training signals, and propose a short-signal training plus long-signal fine-tuning strategy, which largely improves the length extrapolation ability of the networks within limited training time. Overall, the proposed online SpatialNet achieves outstanding speech enhancement performance for long audio streams, and for both static and moving speakers. The proposed method will be open-sourced in https://github.com/Audio-WestlakeU/NBSS.
翻译:本文针对静态和移动说话人场景,将我们先前提出的离线SpatialNet扩展为长时流式多通道语音增强方法。SpatialNet通过利用空间/导向方向等空间信息区分目标语音与干扰,取得了卓越性能。该网络的核心是一个窄带自注意力模块,用于学习空间向量的时序动态特性。面向长时流式语音增强,我们提出用具有信号长度线性推理复杂度且能保持长时信息学习能力的在线网络替代离线自注意力网络,据此开发了三种变体:(i) 掩蔽自注意力、(ii) Retention(一种线性推理复杂度的自注意力变体)及(iii) Mamba(基于结构化状态空间的类RNN网络)。此外,我们研究了不同网络的外推能力(即测试远长于训练信号的信号),并提出短信号训练配合长信号微调策略,在有限训练时间内大幅提升了网络的外推能力。总体而言,所提在线SpatialNet在长音频流以及静态和移动说话人场景中均实现了卓越的语音增强性能。该方法将在 https://github.com/Audio-WestlakeU/NBSS 开源。