We introduce CrossNet, a complex spectral mapping approach to speaker separation and enhancement in reverberant and noisy conditions. The proposed architecture comprises an encoder layer, a global multi-head self-attention module, a cross-band module, a narrow-band module, and an output layer. CrossNet captures global, cross-band, and narrow-band correlations in the time-frequency domain. To address performance degradation in long utterances, we introduce a random chunk positional encoding. Experimental results on multiple datasets demonstrate the effectiveness and robustness of CrossNet, achieving state-of-the-art performance in tasks including reverberant and noisy-reverberant speaker separation. Furthermore, CrossNet exhibits faster and more stable training in comparison to recent baselines. Additionally, CrossNet's high performance extends to multi-microphone conditions, demonstrating its versatility in various acoustic scenarios.
翻译:我们提出CrossNet——一种在混响和噪声条件下实现说话人分离与增强的复数频谱映射方法。该架构由编码层、全局多头自注意力模块、跨带模块、窄带模块及输出层构成。CrossNet能够捕捉时频域中的全局、跨带和窄带相关性。为解决长语音中的性能退化问题,我们引入随机块位置编码。多个数据集上的实验结果证明了CrossNet的有效性和鲁棒性,在混响及含噪混响说话人分离等任务中均达到最优性能。此外,与近期基线方法相比,CrossNet展现出更快速且更稳定的训练过程。值得注意的是,其优异性能可扩展至多麦克风场景,证明了该方法在不同声学环境中的广泛适用性。