This work proposes a neural network to extensively exploit spatial information for multichannel joint speech separation, denoising and dereverberation, named SpatialNet.In the short-time Fourier transform (STFT) domain, the proposed network performs end-to-end speech enhancement. It is mainly composed of interleaved narrow-band and cross-band blocks to respectively exploit narrow-band and cross-band spatial information. The narrow-band blocks process frequencies independently, and use self-attention mechanism and temporal convolutional layers to respectively perform spatial-feature-based speaker clustering and temporal smoothing/filtering. The cross-band blocks processes frames independently, and use full-band linear layer and frequency convolutional layers to respectively learn the correlation between all frequencies and adjacent frequencies. Experiments are conducted on various simulated and real datasets, and the results show that 1) the proposed network achieves the state-of-the-art performance on almost all tasks; 2) the proposed network suffers little from the spectral generalization problem; and 3) the proposed network is indeed performing speaker clustering (demonstrated by attention maps).
翻译:本文提出一种名为SpatialNet的神经网络,旨在充分挖掘空间信息以实现多信道联合语音分离、降噪与去混响。在短时傅里叶变换(STFT)域中,该网络以端到端方式完成语音增强。其主体结构由交错排列的窄带模块与跨频模块组成,分别用于提取窄带与跨频空间信息。窄带模块独立处理各频率,利用自注意力机制与时间卷积层分别实现基于空间特征的说话人聚类与时域平滑/滤波;跨频模块则独立处理各帧,通过全频线性层与频率卷积层分别学习全频带与邻频带之间的相关性。实验在多种模拟与真实数据集上进行,结果表明:1)该网络在几乎所有任务上均达到最优性能;2)该网络几乎不受频谱泛化问题影响;3)该网络确实实现了说话人聚类功能(由注意力图验证)。