In this study, we propose a dense frequency-time attentive network (DeFT-AN) for multichannel speech enhancement. DeFT-AN is a mask estimation network that predicts a complex spectral masking pattern for suppressing the noise and reverberation embedded in the short-time Fourier transform (STFT) of an input signal. The proposed mask estimation network incorporates three different types of blocks for aggregating information in the spatial, spectral, and temporal dimensions. It utilizes a spectral transformer with a modified feed-forward network and a temporal conformer with sequential dilated convolutions. The use of dense blocks and transformers dedicated to the three different characteristics of audio signals enables more comprehensive enhancement in noisy and reverberant environments. The remarkable performance of DeFT-AN over state-of-the-art multichannel models is demonstrated based on two popular noisy and reverberant datasets in terms of various metrics for speech quality and intelligibility.
翻译:本研究提出了一种密集频率-时间注意力网络(DeFT-AN),用于多通道语音增强。DeFT-AN是一种掩码估计网络,能够预测复频谱掩码模式,以抑制输入信号短时傅里叶变换(STFT)中嵌入的噪声和混响。所提出的掩码估计网络整合了三种不同类型的模块,分别用于在空间、频谱和时间维度上聚合信息。该网络采用带有前馈网络变体的频谱Transformer和带有序列化膨胀卷积的时间Conformer。针对音频信号的三种不同特性,通过使用密集块和Transformer结构,能够在噪声和混响环境下实现更全面的增强效果。基于两个流行的噪声和混响数据集,从语音质量和清晰度的多项指标评估表明,DeFT-AN的性能显著优于当前最先进的多通道模型。