We introduce a time-domain framework for efficient multichannel speech enhancement, emphasizing low latency and computational efficiency. This framework incorporates two compact deep neural networks (DNNs) surrounding a multichannel neural Wiener filter (NWF). The first DNN enhances the speech signal to estimate NWF coefficients, while the second DNN refines the output from the NWF. The NWF, while conceptually similar to the traditional frequency-domain Wiener filter, undergoes a training process optimized for low-latency speech enhancement, involving fine-tuning of both analysis and synthesis transforms. Our research results illustrate that the NWF output, having minimal nonlinear distortions, attains performance levels akin to those of the first DNN, deviating from conventional Wiener filter paradigms. Training all components jointly outperforms sequential training, despite its simplicity. Consequently, this framework achieves superior performance with fewer parameters and reduced computational demands, making it a compelling solution for resource-efficient multichannel speech enhancement.
翻译:我们提出了一种面向高效多通道语音增强的时域框架,重点强调低延迟与计算效率。该框架在两个紧凑型深度神经网络之间集成多通道神经维纳滤波器:第一个深度神经网络用于增强语音信号以估计NWF系数,第二个深度神经网络则精化NWF的输出结果。尽管神经维纳滤波器在概念上与传统频域维纳滤波器相似,但其训练过程针对低延迟语音增强进行了优化,涉及分析与合成变换的联合微调。研究结果表明,由于非线性失真极小,NWF输出可达到与第一个深度神经网络相当的性能水平,这与传统维纳滤波器范式存在显著差异。尽管联合训练所有组件的方法看似简单,但其性能优于分阶段训练策略。因此,本框架能以更少的参数和更低计算需求实现卓越性能,为资源高效型多通道语音增强提供了极具吸引力的解决方案。