In this work, we extend our previously proposed offline SpatialNet for long-term streaming multichannel speech enhancement in both static and moving speaker scenarios. SpatialNet exploits spatial information, such as the spatial/steering direction of speech, for discriminating between target speech and interferences, and achieved outstanding performance. The core of SpatialNet is a narrow-band self-attention module used for learning the temporal dynamic of spatial vectors. Towards long-term streaming speech enhancement, we propose to replace the offline self-attention network with online networks that have linear inference complexity w.r.t signal length and meanwhile maintain the capability of learning long-term information. Three variants are developed based on (i) masked self-attention, (ii) Retention, a self-attention variant with linear inference complexity, and (iii) Mamba, a structured-state-space-based RNN-like network. Moreover, we investigate the length extrapolation ability of different networks, namely test on signals that are much longer than training signals, and propose a short-signal training plus long-signal fine-tuning strategy, which largely improves the length extrapolation ability of the networks within limited training time. Overall, the proposed online SpatialNet achieves outstanding speech enhancement performance for long audio streams, and for both static and moving speakers. The proposed method is open-sourced in https://github.com/Audio-WestlakeU/NBSS.
翻译:本研究将先前提出的离线SpatialNet扩展应用于长时流式多通道语音增强任务,涵盖静态与移动说话人场景。SpatialNet通过利用空间信息(如语音的空间/导向方位)来区分目标语音与干扰信号,已取得卓越性能。其核心是一个用于学习空间向量时序动态的窄带自注意力模块。为实现长时流式语音增强,我们提出用具有信号长度线性推理复杂度的在线网络替代离线自注意力网络,同时保持学习长时信息的能力。基于以下三种架构开发了变体模型:(i)掩码自注意力,(ii)具有线性推理复杂度的自注意力变体Retention,以及(iii)基于结构化状态空间、类似RNN的Mamba网络。此外,我们研究了不同网络的长序列外推能力(即测试信号远长于训练信号),并提出“短信号训练加长信号微调”策略,在有限训练时间内显著提升了网络的长序列外推能力。总体而言,所提出的在线SpatialNet在长音频流处理中实现了卓越的语音增强性能,适用于静态与移动说话人场景。该方法已在https://github.com/Audio-WestlakeU/NBSS开源。