The training of deep learning-based multichannel speech enhancement and source localization systems relies heavily on the simulation of room impulse response and multichannel diffuse noise, due to the lack of large-scale real-recorded datasets. However, the acoustic mismatch between simulated and real-world data could degrade the model performance when applying in real-world scenarios. To bridge this simulation-to-real gap, this paper presents a new relatively large-scale Real-recorded and annotated Microphone Array speech&Noise (RealMAN) dataset. The proposed dataset is valuable in two aspects: 1) benchmarking speech enhancement and localization algorithms in real scenarios; 2) offering a substantial amount of real-world training data for potentially improving the performance of real-world applications. Specifically, a 32-channel array with high-fidelity microphones is used for recording. A loudspeaker is used for playing source speech signals (about 35 hours of Mandarin speech). A total of 83.7 hours of speech signals (about 48.3 hours for static speaker and 35.4 hours for moving speaker) are recorded in 32 different scenes, and 144.5 hours of background noise are recorded in 31 different scenes. Both speech and noise recording scenes cover various common indoor, outdoor, semi-outdoor and transportation environments, which enables the training of general-purpose speech enhancement and source localization networks. To obtain the task-specific annotations, speaker location is annotated with an omni-directional fisheye camera by automatically detecting the loudspeaker. The direct-path signal is set as the target clean speech for speech enhancement, which is obtained by filtering the source speech signal with an estimated direct-path propagation filter.
翻译:基于深度学习的多通道语音增强与声源定位系统的训练,由于缺乏大规模真实录制数据集,严重依赖于房间脉冲响应和多通道扩散噪声的模拟。然而,模拟数据与真实世界数据之间的声学失配,可能导致模型在真实场景中应用时性能下降。为弥合这一仿真与现实的差距,本文提出了一个相对大规模的、真实录制并标注的麦克风阵列语音与噪声数据集。该数据集在两方面具有重要价值:1)为真实场景下的语音增强与定位算法提供基准测试;2)为潜在提升现实应用性能提供大量真实世界训练数据。具体而言,我们使用一个配备高保真麦克风的32通道阵列进行录制。一个扬声器用于播放源语音信号。总计在32个不同场景中录制了83.7小时的语音信号,并在31个不同场景中录制了144.5小时的背景噪声。语音与噪声录制场景覆盖了多种常见的室内、室外、半室外及交通环境,这使得训练通用语音增强与声源定位网络成为可能。为获得任务特定标注,说话者位置通过全向鱼眼相机自动检测扬声器进行标注。语音增强的目标干净语音设定为直达路径信号,该信号通过对源语音信号应用估计的直达路径传播滤波器获得。