Multi-channel speech enhancement extracts speech using multiple microphones that capture spatial cues. Effectively utilizing directional information is key for multi-channel enhancement. Deep learning shows great potential on multi-channel speech enhancement and often takes short-time Fourier Transform (STFT) as inputs directly. To fully leverage the spatial information, we introduce a method using spherical harmonics transform (SHT) coefficients as auxiliary model inputs. These coefficients concisely represent spatial distributions. Specifically, our model has two encoders, one for the STFT and another for the SHT. By fusing both encoders in the decoder to estimate the enhanced STFT, we effectively incorporate spatial context. Evaluations on TIMIT under varying noise and reverberation show our model outperforms established benchmarks. Remarkably, this is achieved with fewer computations and parameters. By leveraging spherical harmonics to incorporate directional cues, our model efficiently improves the performance of the multi-channel speech enhancement.
翻译:多通道语音增强利用多个麦克风捕获的空间线索提取语音信号。有效利用方向信息是实现多通道增强的关键。深度学习方法在多通道语音增强领域展现出巨大潜力,常直接采用短时傅里叶变换(STFT)作为输入。为充分利用空间信息,我们提出一种方法,将球谐函数变换(SHT)系数作为辅助模型输入。这些系数能够简洁地表示空间分布。具体而言,我们的模型包含两个编码器:一个处理STFT,另一个处理SHT。通过在解码器中融合两个编码器的信息以估计增强后的STFT,我们有效整合了空间上下文。在TIMIT数据集上不同噪声和混响条件下的评估表明,我们的模型优于现有基准方法。值得注意的是,该模型在显著减少计算量和参数数量的同时实现了性能提升。通过利用球谐函数引入方向信息,我们的模型高效提升了多通道语音增强的性能。