This work proposes a neural network to extensively exploit spatial information for multichannel joint speech separation, denoising and dereverberation, named SpatialNet. In the short-time Fourier transform (STFT) domain, the proposed network performs end-to-end speech enhancement. It is mainly composed of interleaved narrow-band and cross-band blocks to respectively exploit narrow-band and cross-band spatial information. The narrow-band blocks process frequencies independently, and use self-attention mechanism and temporal convolutional layers to respectively perform spatial-feature-based speaker clustering and temporal smoothing/filtering. The cross-band blocks process frames independently, and use full-band linear layer and frequency convolutional layers to respectively learn the correlation between all frequencies and adjacent frequencies. Experiments are conducted on various simulated and real datasets, and the results show that 1) the proposed network achieves the state-of-the-art performance on almost all tasks; 2) the proposed network suffers little from the spectral generalization problem; and 3) the proposed network is indeed performing speaker clustering (demonstrated by attention maps).
翻译:本文提出了一种名为SpatialNet的神经网络,旨在充分利用空间信息实现多通道联合语音分离、去噪与去混响。在短时傅里叶变换(STFT)域中,该网络以端到端方式执行语音增强任务。其主要由交错排列的窄带模块与跨带模块构成,分别用于挖掘窄带与跨带空间信息。窄带模块对各频率独立处理,通过自注意力机制和时序卷积层分别实现基于空间特征的说话人聚类与时间平滑/滤波;跨带模块对各帧独立处理,利用全频带线性层与频率卷积层分别学习全频带与邻频相关性。我们在多种仿真与真实数据集上开展实验,结果表明:1)该网络在几乎所有任务中均达到最优性能;2)该网络几乎不受频谱泛化问题影响;3)注意力图证实该网络确实执行了说话人聚类功能。