Recent speech enhancement methods based on convolutional neural networks (CNNs) and transformer have been demonstrated to efficaciously capture time-frequency (T-F) information on spectrogram. However, the correlation of each channels of speech features is failed to explore. Theoretically, each channel map of speech features obtained by different convolution kernels contains information with different scales demonstrating strong correlations. To fill this gap, we propose a novel dual-branch architecture named channel-aware dual-branch conformer (CADB-Conformer), which effectively explores the long range time and frequency correlations among different channels, respectively, to extract channel relation aware time-frequency information. Ablation studies conducted on DNS-Challenge 2020 dataset demonstrate the importance of channel feature leveraging while showing the significance of channel relation aware T-F information for speech enhancement. Extensive experiments also show that the proposed model achieves superior performance than recent methods with an attractive computational costs.
翻译:近期基于卷积神经网络(CNN)与Transformer的语音增强方法已被证明能有效捕捉语谱图的时频(T-F)信息。然而,现有方法未能充分挖掘语音特征各通道间的相关性。理论上,通过不同卷积核获得的语音特征各通道图包含多尺度信息且具有强相关性。为填补这一空白,本文提出一种新颖的双分支架构——通道感知双分支Conformer(CADB-Conformer),该架构通过独立分支分别探索不同通道间的长程时间相关性与频率相关性,从而提取通道关系感知的时频信息。在DNS-Challenge 2020数据集上的消融实验验证了通道特征利用的重要性,同时揭示了通道关系感知的时频信息对语音增强的关键作用。大量实验表明,所提模型在保持较低计算成本的同时,性能优于当前主流方法。