Recent high-performance transformer-based speech enhancement models demonstrate that time domain methods could achieve similar performance as time-frequency domain methods. However, time-domain speech enhancement systems typically receive input audio sequences consisting of a large number of time steps, making it challenging to model extremely long sequences and train models to perform adequately. In this paper, we utilize smaller audio chunks as input to achieve efficient utilization of audio information to address the above challenges. We propose a dual-phase audio transformer for denoising (DPATD), a novel model to organize transformer layers in a deep structure to learn clean audio sequences for denoising. DPATD splits the audio input into smaller chunks, where the input length can be proportional to the square root of the original sequence length. Our memory-compressed explainable attention is efficient and converges faster compared to the frequently used self-attention module. Extensive experiments demonstrate that our model outperforms state-of-the-art methods.
翻译:近期基于变压器的高性能语音增强模型表明,时域方法能达到与频域方法相近的性能。然而,时域语音增强系统通常接收包含大量时间步长的输入音频序列,这使得建模极长序列及充分训练模型面临挑战。本文利用较小音频块作为输入,实现音频信息的高效利用以解决上述难题。我们提出双阶段音频去噪变压器(DPATD),这是一种新颖的深度结构组织变压器层以学习干净音频序列用于去噪的模型。DPATD将音频输入拆分为更小的块,使输入长度可降至原始序列长度的平方根量级。与常用自注意力模块相比,我们提出的内存压缩可解释注意力机制更高效且收敛更快。大量实验表明,我们的模型性能优于现有最优方法。