We present Spatial LibriSpeech, a spatial audio dataset with over 650 hours of 19-channel audio, first-order ambisonics, and optional distractor noise. Spatial LibriSpeech is designed for machine learning model training, and it includes labels for source position, speaking direction, room acoustics and geometry. Spatial LibriSpeech is generated by augmenting LibriSpeech samples with 200k+ simulated acoustic conditions across 8k+ synthetic rooms. To demonstrate the utility of our dataset, we train models on four spatial audio tasks, resulting in a median absolute error of 6.60{\deg} on 3D source localization, 0.43m on distance, 90.66ms on T30, and 2.74dB on DRR estimation. We show that the same models generalize well to widely-used evaluation datasets, e.g., obtaining a median absolute error of 12.43{\deg} on 3D source localization on TUT Sound Events 2018, and 157.32ms on T30 estimation on ACE Challenge.
翻译:我们提出 Spatial LibriSpeech,这是一个包含超过650小时19通道音频、一阶环境声以及可选干扰噪声的空间音频数据集。Spatial LibriSpeech专为机器学习模型训练设计,包含声源位置、说话方向、房间声学与几何结构的标签。该数据集通过将LibriSpeech样本与8000多个合成房间中的20万+模拟声学条件进行增强生成。为展示该数据集的实用性,我们在四个空间音频任务上训练模型,在3D声源定位任务中达到6.60°的中位绝对误差,距离任务为0.43米,T30估计为90.66毫秒,DRR估计为2.74分贝。实验表明,相同模型可良好泛化至广泛使用的评测数据集,例如在TUT Sound Events 2018上实现3D声源定位中位绝对误差12.43°,在ACE Challenge上实现T30估计误差157.32毫秒。