Previous research in speech enhancement has mostly focused on modeling time or time-frequency domain information alone, with little consideration given to the potential benefits of simultaneously modeling both domains. Since these domains contain complementary information, combining them may improve the performance of the model. In this letter, we propose a new approach to simultaneously model time and time-frequency domain information in a single model. We begin with the DPT-FSNet (causal version) model as a baseline and modify the encoder structure by replacing the original encoder with three separate encoders, each dedicated to modeling time-domain, real-imaginary, and magnitude information, respectively. Additionally, we introduce a feature fusion module both before and after the dual-path processing blocks to better leverage information from the different domains. The outcomes of our experiments reveal that the proposed approach achieves superior performance compared to existing state-of-the-art causal models, while preserving a relatively compact model size and low computational complexity.
翻译:以往的语音增强研究大多集中于单独对时域或时频域信息进行建模,鲜有考虑同时建模两个域的潜在优势。由于这些域包含互补信息,将其结合可能提升模型性能。本文提出了一种新方法,能够在单一模型中同时建模时域与时频域信息。我们以DPT-FSNet(因果版本)模型为基线,通过将原始编码器替换为三个独立编码器来修改编码器结构,每个编码器分别专用于建模时域、实部-虚部以及幅度信息。此外,我们在双路径处理模块前后引入特征融合模块,以更好地利用来自不同域的信息。实验结果表明,与现有最先进的因果模型相比,所提方法在保持相对紧凑的模型尺寸和较低计算复杂度的同时,实现了更优的性能。