Neural multi-channel speech enhancement models, in particular those based on the U-Net architecture, demonstrate promising performance and generalization potential. These models typically encode input channels independently, and integrate the channels during later stages of the network. In this paper, we propose a novel modification of these models by incorporating relative information from the outset, where each channel is processed in conjunction with a reference channel through stacking. This input strategy exploits comparative differences to adaptively fuse information between channels, thereby capturing crucial spatial information and enhancing the overall performance. The experiments conducted on the CHiME-3 dataset demonstrate improvements in speech enhancement metrics across various architectures.
翻译:基于神经网络的多通道语音增强模型,特别是那些基于U-Net架构的模型,展现出优异的性能和泛化潜力。这些模型通常独立编码输入通道,并在网络后期阶段整合通道信息。本文提出一种新颖的改进方案:通过堆叠处理使每个通道与参考通道协同计算,从初始阶段就引入相对信息。这种输入策略利用通道间的比较差异实现自适应信息融合,从而捕获关键的空间信息并提升整体性能。在CHiME-3数据集上进行的实验表明,该方法能在多种架构中持续提升语音增强指标。