Transformers have enabled impressive improvements in deep learning. They often outperform recurrent and convolutional models in many tasks while taking advantage of parallel processing. Recently, we proposed the SepFormer, which obtains state-of-the-art performance in speech separation with the WSJ0-2/3 Mix datasets. This paper studies in-depth Transformers for speech separation. In particular, we extend our previous findings on the SepFormer by providing results on more challenging noisy and noisy-reverberant datasets, such as LibriMix, WHAM!, and WHAMR!. Moreover, we extend our model to perform speech enhancement and provide experimental evidence on denoising and dereverberation tasks. Finally, we investigate, for the first time in speech separation, the use of efficient self-attention mechanisms such as Linformers, Lonformers, and ReFormers. We found that they reduce memory requirements significantly. For example, we show that the Reformer-based attention outperforms the popular Conv-TasNet model on the WSJ0-2Mix dataset while being faster at inference and comparable in terms of memory consumption.
翻译:Transformer在深度学习中取得了令人瞩目的进展。在许多任务中,它们往往优于循环和卷积模型,同时还能利用并行处理优势。近期,我们提出了SepFormer,其在WSJ0-2/3 Mix数据集上实现了语音分离的最先进性能。本文深入研究了Transformer在语音分离中的应用。具体而言,我们通过提供在更具挑战性的含噪及含噪混响数据集(如LibriMix、WHAM!和WHAMR!)上的实验结果,扩展了先前关于SepFormer的研究发现。此外,我们将模型扩展至语音增强任务,并提供了去噪和去混响任务的实验证据。最后,我们在语音分离领域首次研究了高效自注意力机制(如Linformer、Lonformer和Reformer)的应用。实验发现,这些机制能显著降低内存需求。例如,我们证明基于Reformer的注意力机制在WSJ0-2Mix数据集上优于流行的Conv-TasNet模型,同时推理速度更快,内存消耗相当。