In this paper, we propose a novel four-stage data augmentation approach to ResNet-Conformer based acoustic modeling for sound event localization and detection (SELD). First, we explore two spatial augmentation techniques, namely audio channel swapping (ACS) and multi-channel simulation (MCS), to deal with data sparsity in SELD. ACS and MDS focus on augmenting the limited training data with expanding direction of arrival (DOA) representations such that the acoustic models trained with the augmented data are robust to localization variations of acoustic sources. Next, time-domain mixing (TDM) and time-frequency masking (TFM) are also investigated to deal with overlapping sound events and data diversity. Finally, ACS, MCS, TDM and TFM are combined in a step-by-step manner to form an effective four-stage data augmentation scheme. Tested on the Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 data set, our proposed augmentation approach greatly improves the system performance, ranking our submitted system in the first place in the SELD task of the DCASE 2020 Challenge. Furthermore, we employ a ResNet-Conformer architecture to model both global and local context dependencies of an audio sequence and win the first place in the DCASE 2022 SELD evaluations.
翻译:本文提出一种面向基于ResNet-Conformer声学建模的声事件定位与检测(SELD)的四阶段数据增强新方法。首先,我们探索两种空间增强技术——音频通道交换(ACS)和多通道仿真(MCS)——以应对SELD中数据稀疏性问题。ACS和MCS通过扩展到达方向(DOA)表征来增强有限的训练数据,使得基于增强数据训练的声学模型对声源定位变化具有鲁棒性。其次,研究了时域混合(TDM)和时频掩蔽(TFM)以应对重叠声事件与数据多样性问题。最后,将ACS、MCS、TDM和TFM按步骤组合,形成有效的四阶段数据增强方案。在声学场景与事件检测与分类(DCASE)2020数据集上的测试表明,所提出的增强方法大幅提升了系统性能,使提交系统在DCASE 2020挑战赛的SELD任务中位列第一。此外,我们采用ResNet-Conformer架构对音频序列的全局与局部上下文依赖进行建模,在DCASE 2022 SELD评估中再次夺得第一名。