The training of deep learning-based multichannel speech enhancement and source localization systems relies heavily on the simulation of room impulse response and multichannel diffuse noise, due to the lack of large-scale real-recorded datasets. However, the acoustic mismatch between simulated and real-world data could degrade the model performance when applying in real-world scenarios. To bridge this simulation-to-real gap, this paper presents a new relatively large-scale Real-recorded and annotated Microphone Array speech&Noise (RealMAN) dataset. The proposed dataset is valuable in two aspects: 1) benchmarking speech enhancement and localization algorithms in real scenarios; 2) offering a substantial amount of real-world training data for potentially improving the performance of real-world applications. Specifically, a 32-channel array with high-fidelity microphones is used for recording. A loudspeaker is used for playing source speech signals. A total of 83-hour speech signals (48 hours for static speaker and 35 hours for moving speaker) are recorded in 32 different scenes, and 144 hours of background noise are recorded in 31 different scenes. Both speech and noise recording scenes cover various common indoor, outdoor, semi-outdoor and transportation environments, which enables the training of general-purpose speech enhancement and source localization networks. To obtain the task-specific annotations, the azimuth angle of the loudspeaker is annotated with an omni-direction fisheye camera by automatically detecting the loudspeaker. The direct-path signal is set as the target clean speech for speech enhancement, which is obtained by filtering the source speech signal with an estimated direct-path propagation filter.
翻译:基于深度学习的多通道语音增强与声源定位系统的训练严重依赖于房间冲激响应和多通道扩散噪声的模拟,这主要是由于缺乏大规模真实录制数据集。然而,模拟数据与真实世界数据之间的声学失配,可能导致模型在实际场景中应用时性能下降。为弥合这一模拟与现实的差距,本文提出了一个规模相对较大的真实录制标注麦克风阵列语音与噪声数据集。该数据集的价值体现在两个方面:1)为真实场景下的语音增强与定位算法提供基准测试;2)为潜在提升现实应用性能提供大量真实世界训练数据。具体而言,我们使用一个配备高保真麦克风的32通道阵列进行录制,并通过一个扬声器播放源语音信号。总计83小时的语音信号在32个不同场景中录制完成,其中包括静态说话人48小时和移动说话人35小时;此外,在31个不同场景中录制了144小时的背景噪声。语音与噪声的录制场景涵盖了多种常见的室内、室外、半室外及交通环境,从而能够支持通用语音增强与声源定位网络的训练。为获得任务专用标注,我们通过全向鱼眼摄像头自动检测扬声器,对其方位角进行标注。对于语音增强任务,将直达路径信号设定为目标纯净语音,该信号通过对源语音信号施加估计的直达路径传播滤波器而获得。