3D speech enhancement can effectively improve the auditory experience and plays a crucial role in augmented reality technology. However, traditional convolutional-based speech enhancement methods have limitations in extracting dynamic voice information. In this paper, we incorporate a dual-path recurrent neural network block into the U-Net to iteratively extract dynamic audio information in both the time and frequency domains. And an attention mechanism is proposed to fuse the original signal, reference signal, and generated masks. Moreover, we introduce a loss function to simultaneously optimize the network in the time-frequency and time domains. Experimental results show that our system outperforms the state-of-the-art systems on the dataset of ICASSP L3DAS23 challenge.
翻译:3D语音增强能够有效提升听觉体验,在增强现实技术中发挥关键作用。然而,传统的基于卷积的语音增强方法在提取动态语音信息方面存在局限性。本文在U-Net中引入双路径递归神经网络模块,以在时域和频域上迭代提取动态音频信息。同时提出一种注意力机制,用于融合原始信号、参考信号及生成的掩码。此外,我们设计了一个损失函数,在时频域和时域上同步优化网络。实验结果表明,在ICASSP L3DAS23挑战赛数据集上,本系统性能优于当前最先进的方法。