Our previously proposed MossFormer has achieved promising performance in monaural speech separation. However, it predominantly adopts a self-attention-based MossFormer module, which tends to emphasize longer-range, coarser-scale dependencies, with a deficiency in effectively modelling finer-scale recurrent patterns. In this paper, we introduce a novel hybrid model that provides the capabilities to model both long-range, coarse-scale dependencies and fine-scale recurrent patterns by integrating a recurrent module into the MossFormer framework. Instead of applying the recurrent neural networks (RNNs) that use traditional recurrent connections, we present a recurrent module based on a feedforward sequential memory network (FSMN), which is considered "RNN-free" recurrent network due to the ability to capture recurrent patterns without using recurrent connections. Our recurrent module mainly comprises an enhanced dilated FSMN block by using gated convolutional units (GCU) and dense connections. In addition, a bottleneck layer and an output layer are also added for controlling information flow. The recurrent module relies on linear projections and convolutions for seamless, parallel processing of the entire sequence. The integrated MossFormer2 hybrid model demonstrates remarkable enhancements over MossFormer and surpasses other state-of-the-art methods in WSJ0-2/3mix, Libri2Mix, and WHAM!/WHAMR! benchmarks.
翻译:我们先前提出的MossFormer在单声道语音分离中取得了优异性能。然而,该模型主要采用基于自注意力的MossFormer模块,偏向于建模长程、粗粒度依赖关系,在有效捕捉细粒度递归模式方面存在不足。本文提出一种新型混合模型,通过将递归模块集成到MossFormer框架中,实现了对长程粗粒度依赖与细粒度递归模式的联合建模能力。不同于采用传统递归连接的循环神经网络(RNN),我们提出基于前馈序列记忆网络(FSMN)的递归模块——该网络因具备无需递归连接即可捕捉递归模式的能力而被视为"无RNN"递归网络。我们的递归模块主要由增强型扩张FSMN块构成,采用门控卷积单元(GCU)和密集连接技术,并增设瓶颈层与输出层以控制信息流。该递归模块通过线性投影与卷积实现对完整序列的无缝并行处理。整合后的MossFormer2混合模型在WSJ0-2/3mix、Libri2Mix及WHAM!/WHAMR!基准测试中展现出显著优于MossFormer的性能,并超越其他最先进方法。