Sound event detection (SED) methods that leverage a large pre-trained Transformer encoder network have shown promising performance in recent DCASE challenges. However, they still rely on an RNN-based context network to model temporal dependencies, largely due to the scarcity of labeled data. In this work, we propose a pure Transformer-based SED model with masked-reconstruction based pre-training, termed MAT-SED. Specifically, a Transformer with relative positional encoding is first designed as the context network, pre-trained by the masked-reconstruction task on all available target data in a self-supervised way. Both the encoder and the context network are jointly fine-tuned in a semi-supervised manner. Furthermore, a global-local feature fusion strategy is proposed to enhance the localization capability. Evaluation of MAT-SED on DCASE2023 task4 surpasses state-of-the-art performance, achieving 0.587/0.896 PSDS1/PSDS2 respectively.
翻译:利用大规模预训练Transformer编码器网络的声音事件检测方法在近年DCASE挑战赛中展现出优异性能。然而,这些方法仍需依赖基于RNN的上下文网络建模时序依赖,这主要源于标注数据的稀缺性。本研究提出一种基于掩码重构预训练的纯Transformer结构SED模型,命名为MAT-SED。具体而言,我们首先设计采用相对位置编码的Transformer作为上下文网络,通过自监督方式在所有可用目标数据上进行掩码重构任务预训练。编码器与上下文网络以半监督方式联合微调。此外,本文提出全局-局部特征融合策略以增强定位能力。在DCASE2023任务4上的评估表明,MAT-SED以0.587/0.896的PSDS1/PSDS2指标超越了现有最优性能。