As deeper and more complex models are developed for the task of sound event localization and detection (SELD), the demand for annotated spatial audio data continues to increase. Annotating field recordings with 360$^{\circ}$ video takes many hours from trained annotators, while recording events within motion-tracked laboratories are bounded by cost and expertise. Because of this, localization models rely on a relatively limited amount of spatial audio data in the form of spatial room impulse response (SRIR) datasets, which limits the progress of increasingly deep neural network based approaches. In this work, we demonstrate that simulated geometrical acoustics can provide an appealing solution to this problem. We use simulated geometrical acoustics to generate a novel SRIR dataset that can train a SELD model to provide similar performance to that of a real SRIR dataset. Furthermore, we demonstrate using simulated data to augment existing datasets, improving on benchmarks set by state of the art SELD models. We explore the potential and limitations of geometric acoustic simulation for localization and event detection. We also propose further studies to verify the limitations of this method, as well as further methods to generate synthetic data for SELD tasks without the need to record more data.
翻译:随着声音事件定位与检测(SELD)任务中模型不断向更深层、更复杂的方向发展,对带标注的空间音频数据的需求持续增长。使用360°视频对实地录音进行标注需要训练有素的标注人员花费大量时间,而在运动追踪实验室中录制事件则受限于成本和专业能力。因此,定位模型依赖于相对有限的空间音频数据(以空间房间脉冲响应(SRIR)数据集的形式),这限制了基于深度神经网络的日益复杂方法的进展。在本工作中,我们证明模拟几何声学可为该问题提供一种有吸引力的解决方案。我们利用模拟几何声学生成一个新颖的SRIR数据集,该数据集能够训练SELD模型,使其性能与使用真实SRIR数据集相当。此外,我们证明使用模拟数据增强现有数据集,可提升由当前最优SELD模型设定的基准性能。我们探讨了几何声学模拟在定位与事件检测中的潜力与局限性,并提出进一步的研究以验证该方法的局限,以及生成用于SELD任务的合成数据的其他方法,而无需录制更多数据。