Our previously proposed MossFormer has achieved promising performance in monaural speech separation. However, it predominantly adopts a self-attention-based MossFormer module, which tends to emphasize longer-range, coarser-scale dependencies, with a deficiency in effectively modelling finer-scale recurrent patterns. In this paper, we introduce a novel hybrid model that provides the capabilities to model both long-range, coarse-scale dependencies and fine-scale recurrent patterns by integrating a recurrent module into the MossFormer framework. Instead of applying the recurrent neural networks (RNNs) that use traditional recurrent connections, we present a recurrent module based on a feedforward sequential memory network (FSMN), which is considered "RNN-free" recurrent network due to the ability to capture recurrent patterns without using recurrent connections. Our recurrent module mainly comprises an enhanced dilated FSMN block by using gated convolutional units (GCU) and dense connections. In addition, a bottleneck layer and an output layer are also added for controlling information flow. The recurrent module relies on linear projections and convolutions for seamless, parallel processing of the entire sequence. The integrated MossFormer2 hybrid model demonstrates remarkable enhancements over MossFormer and surpasses other state-of-the-art methods in WSJ0-2/3mix, Libri2Mix, and WHAM!/WHAMR! benchmarks (https://github.com/modelscope/ClearerVoice-Studio).
翻译:我们先前提出的MossFormer在单通道语音分离任务中已取得良好性能。然而,该模型主要采用基于自注意力的MossFormer模块,其倾向于强调更长距离、更粗粒度的依赖关系,而在有效建模细粒度循环模式方面存在不足。本文提出一种新颖的混合模型,通过将循环模块集成到MossFormer框架中,使其同时具备建模长距离粗粒度依赖与细粒度循环模式的能力。我们未采用传统循环连接的循环神经网络(RNN),而是提出一种基于前馈序列记忆网络(FSMN)的循环模块,该模块因无需循环连接即可捕获循环模式而被视为“无RNN”循环网络。我们的循环模块主要通过使用门控卷积单元(GCU)和密集连接构建增强型膨胀FSMN块。此外,还添加了瓶颈层和输出层以控制信息流。该循环模块依赖线性投影和卷积实现对整个序列的无缝并行处理。集成后的MossFormer2混合模型在WSJ0-2/3mix、Libri2Mix及WHAM!/WHAMR!基准测试中(https://github.com/modelscope/ClearerVoice-Studio)展现出相对于MossFormer的显著提升,并超越了其他先进方法。