Training neural networks (NNs) for speech enhancement (SE) in distant speech-capturing scenarios requires paired distorted and clean reference speech signals. While such data are often generated through simulation, the mismatch between simulated and real recordings significantly limits SE accuracy. To address this issue, we propose Close-to-Distant microphone Projection (C2D projection), a method that generates paired data from real recordings captured by close and distant microphones. C2D projection estimates an optimal projection matrix that transforms close-microphone inputs into clean reference signals aligned with distant-microphone recordings, while simultaneously performing denoising. We show this projection can be effectively realized using a variant of the Parametric Multichannel Wiener Filter (PMWF). Experimental results demonstrate that an NN trained with C2D-projected data outperforms the state-of-the-art Guided Source Separation (GSS) on the challenging CHiME6 dinner party ASR task under oracle diarization, when using the enhanced output from GSS as an auxiliary input to the NN.
翻译:在远距离语音采集场景下训练用于语音增强(SE)的神经网络(NNs),需要成对的失真语音信号与纯净参考语音信号。虽然这类数据通常通过仿真生成,但仿真与真实录音之间的失配会显著限制SE的准确性。为解决此问题,我们提出远近麦克风投影(C2D投影)方法,该方法利用近距离和远距离麦克风采集的真实录音生成成对数据。C2D投影通过估计最优投影矩阵,将近距离麦克风输入转化为与远距离麦克风录音对齐的纯净参考信号,同步完成去噪处理。我们证明该投影可通过参数化多通道维纳滤波器(PMWF)的变体有效实现。实验结果表明,在CHiME6鸡尾酒会自动语音识别(ASR)任务中,采用C2D投影数据训练的NN,在给定理想说话人分割条件下,其性能优于当前最先进的引导源分离(GSS)方法——当GSS增强输出作为NN辅助输入时表现尤为突出。