Transformer based models have provided significant performance improvements in monaural speech separation. However, there is still a performance gap compared to a recent proposed upper bound. The major limitation of the current dual-path Transformer models is the inefficient modelling of long-range elemental interactions and local feature patterns. In this work, we achieve the upper bound by proposing a gated single-head transformer architecture with convolution-augmented joint self-attentions, named \textit{MossFormer} (\textit{Mo}naural \textit{s}peech \textit{s}eparation Trans\textit{Former}). To effectively solve the indirect elemental interactions across chunks in the dual-path architecture, MossFormer employs a joint local and global self-attention architecture that simultaneously performs a full-computation self-attention on local chunks and a linearised low-cost self-attention over the full sequence. The joint attention enables MossFormer model full-sequence elemental interaction directly. In addition, we employ a powerful attentive gating mechanism with simplified single-head self-attentions. Besides the attentive long-range modelling, we also augment MossFormer with convolutions for the position-wise local pattern modelling. As a consequence, MossFormer significantly outperforms the previous models and achieves the state-of-the-art results on WSJ0-2/3mix and WHAM!/WHAMR! benchmarks. Our model achieves the SI-SDRi upper bound of 21.2 dB on WSJ0-3mix and only 0.3 dB below the upper bound of 23.1 dB on WSJ0-2mix.
翻译:基于Transformer的模型在单声道语音分离中取得了显著的性能提升。然而,与近期提出的理论上界相比,仍存在性能差距。当前双路径Transformer模型的主要局限性在于无法高效建模长程元素交互与局部特征模式。在本工作中,我们通过提出一种带有卷积增强联合自注意力的门控单头Transformer架构(命名为MossFormer,即单声道语音分离Transformer)实现了该理论上界。针对双路径架构中跨块的元素间接交互问题,MossFormer采用局部与全局联合自注意力架构,该结构同时实现了局部块上的全计算自注意力和全序列上的线性化低成本自注意力。联合注意力使MossFormer能够直接建模全序列元素交互。此外,我们采用具有简化单头自注意力的强大注意力门控机制。在注意力长程建模之外,我们通过卷积增强MossFormer以进行位置相关的局部模式建模。最终,MossFormer在WSJ0-2/3mix和WHAM!/WHAMR!基准测试中显著超越先前模型,达到最先进水平。我们的模型在WSJ0-3mix上实现21.2 dB的SI-SDRi理论上界,在WSJ0-2mix上仅比23.1 dB的上界低0.3 dB。