In this paper, we propose an effective sound event detection (SED) method based on the audio spectrogram transformer (AST) model, pretrained on the large-scale AudioSet for audio tagging (AT) task, termed AST-SED. Pretrained AST models have recently shown promise on DCASE2022 challenge task4 where they help mitigate a lack of sufficient real annotated data. However, mainly due to differences between the AT and SED tasks, it is suboptimal to directly utilize outputs from a pretrained AST model. Hence the proposed AST-SED adopts an encoder-decoder architecture to enable effective and efficient fine-tuning without needing to redesign or retrain the AST model. Specifically, the Frequency-wise Transformer Encoder (FTE) consists of transformers with self attention along the frequency axis to address multiple overlapped audio events issue in a single clip. The Local Gated Recurrent Units Decoder (LGD) consists of nearest-neighbor interpolation (NNI) and Bidirectional Gated Recurrent Units (Bi-GRU) to compensate for temporal resolution loss in the pretrained AST model output. Experimental results on DCASE2022 task4 development set have demonstrated the superiority of the proposed AST-SED with FTE-LGD architecture. Specifically, the Event-Based F1-score (EB-F1) of 59.60% and Polyphonic Sound detection Score scenario1 (PSDS1) score of 0.5140 significantly outperform CRNN and other pretrained AST-based systems.
翻译:本文提出了一种基于音频谱图Transformer(AST)模型的有效声音事件检测(SED)方法,该方法在大规模AudioSet上针对音频标签(AT)任务进行了预训练,称为AST-SED。预训练的AST模型近期在DCASE2022挑战赛任务4中展现出前景,有助于缓解真实标注数据不足的问题。然而,由于AT与SED任务之间的差异,直接利用预训练AST模型的输出并非最优方案。因此,所提出的AST-SED采用编码器-解码器架构,无需重新设计或重新训练AST模型,即可实现高效且有效的微调。具体而言,频率维Transformer编码器(FTE)由沿频率轴进行自注意力的Transformer构成,以解决单个音频片段中多个重叠声音事件的问题。局部门控循环单元解码器(LGD)由最近邻插值(NNI)和双向门控循环单元(Bi-GRU)组成,用于补偿预训练AST模型输出中的时间分辨率损失。在DCASE2022任务4开发集上的实验结果表明,所提出的基于FTE-LGD架构的AST-SED具有优越性。具体而言,基于事件的F1分数(EB-F1)达到59.60%,多音检测分数场景1(PSDS1)达到0.5140,显著优于CRNN及其他基于预训练AST的系统。