Multi-channel speech enhancement utilizes spatial information from multiple microphones to extract the target speech. However, most existing methods do not explicitly model spatial cues, instead relying on implicit learning from multi-channel spectra. To better leverage spatial information, we propose explicitly incorporating spatial modeling by applying spherical harmonic transforms (SHT) to the multi-channel input. In detail, a hierarchical framework is introduced whereby lower order harmonics capturing broader spatial patterns are estimated first, then combined with higher orders to recursively predict finer spatial details. Experiments on TIMIT demonstrate the proposed method can effectively recover target spatial patterns and achieve improved performance over baseline models, using fewer parameters and computations. Explicitly modeling spatial information hierarchically enables more effective multi-channel speech enhancement.
翻译:多通道语音增强利用多个麦克风的空间信息来提取目标语音。然而,现有方法大多未显式建模空间线索,而是依赖多通道频谱的隐式学习。为更好利用空间信息,我们提出通过对多通道输入应用球谐变换(SHT)来显式融入空间建模。具体而言,引入层次化框架:先估计捕获宽空间模式的低阶谐波,再与高阶相结合以递归预测更精细的空间细节。在TIMIT上的实验表明,所提方法能有效恢复目标空间模式,并在使用更少参数和计算量的情况下,相比基线模型实现更优性能。显式层次化建模空间信息能够实现更有效的多通道语音增强。