Achieving superior enhancement performance while maintaining a low parameter count and computational complexity remains a challenge in the field of speech enhancement. In this paper, we introduce LORT, a novel architecture that integrates spatial-channel enhanced Taylor Transformer and locally refined convolution for efficient and robust speech enhancement. We propose a Taylor multi-head self-attention (T-MSA) module enhanced with spatial-channel enhancement attention (SCEA), designed to facilitate inter-channel information exchange and alleviate the spatial attention limitations inherent in Taylor-based Transformers. To complement global modeling, we further present a locally refined convolution (LRC) block that integrates convolutional feed-forward layers, time-frequency dense local convolutions, and gated units to capture fine-grained local details. Built upon a U-Net-like encoder-decoder structure with only 16 output channels in the encoder, LORT processes noisy inputs through multi-resolution T-MSA modules using alternating downsampling and upsampling operations. The enhanced magnitude and phase spectra are decoded independently and optimized through a composite loss function that jointly considers magnitude, complex, phase, discriminator, and consistency objectives. Experimental results on the VCTK+DEMAND and DNS Challenge datasets demonstrate that LORT achieves competitive or superior performance to state-of-the-art (SOTA) models with only 0.96M parameters, highlighting its effectiveness for real-world speech enhancement applications with limited computational resources.
翻译:在保持低参数量与低计算复杂度的同时实现卓越的增强性能,仍然是语音增强领域的一项挑战。本文提出LORT,一种集成空间-通道增强型泰勒变换器与局部精细化卷积的新型架构,用于高效且鲁棒的语音增强。我们提出一种通过空间-通道增强注意力(SCEA)强化的泰勒多头自注意力(T-MSA)模块,旨在促进通道间信息交换并缓解基于泰勒的变换器中固有的空间注意力局限性。为补充全局建模,我们进一步提出局部精细化卷积(LRC)块,该块集成卷积前馈层、时频密集局部卷积和门控单元,以捕捉细粒度的局部细节。LORT构建于类似U-Net的编码器-解码器结构之上,其编码器仅包含16个输出通道,通过交替使用下采样与上采样操作的多分辨率T-MSA模块处理带噪输入。增强后的幅度谱与相位谱被独立解码,并通过综合考虑幅度、复数域、相位、判别器及一致性目标的复合损失函数进行优化。在VCTK+DEMAND和DNS Challenge数据集上的实验结果表明,LORT仅以0.96M参数即达到与当前最先进(SOTA)模型相当或更优的性能,突显了其在计算资源受限的实际语音增强应用中的有效性。