In this paper, we propose a two-stage heterogeneous lightweight network for monaural speech enhancement. Specifically, we design a novel two-stage framework consisting of a coarse-grained full-band mask estimation stage and a fine-grained low-frequency refinement stage. Instead of using a hand-designed real-valued filter, we use a novel learnable complex-valued rectangular bandwidth (LCRB) filter bank as an extractor of compact features. Furthermore, considering the respective characteristics of the proposed two-stage task, we used a heterogeneous structure, i.e., a U-shaped subnetwork as the backbone of CoarseNet and a single-scale subnetwork as the backbone of FineNet. We conducted experiments on the VoiceBank + DEMAND and DNS datasets to evaluate the proposed approach. The experimental results show that the proposed method outperforms the current state-of-the-art methods, while maintaining relatively small model size and low computational complexity.
翻译:本文提出了一种用于单通道语音增强的两阶段异质轻量级网络。具体而言,我们设计了一个新颖的两阶段框架,包括粗粒度全频带掩码估计阶段和细粒度低频细化阶段。我们并未采用手工设计的实值滤波器,而是使用一种新颖的可学习复值矩形带宽(LCRB)滤波器组作为紧凑特征的提取器。此外,考虑到所提出的两阶段任务各自的特点,我们采用了异质结构,即U形子网络作为CoarseNet的主干网络,以及单尺度子网络作为FineNet的主干网络。我们在VoiceBank + DEMAND和DNS数据集上进行了实验以评估所提出的方法。实验结果显示,该方法在保持相对较小的模型尺寸和较低计算复杂度的同时,优于当前最先进的方法。