Sound event detection (SED) often suffers from the data deficiency problem. The recent baseline system in the DCASE2023 challenge task 4 leverages the large pretrained self-supervised learning (SelfSL) models to mitigate such restriction, where the pretrained models help to produce more discriminative features for SED. However, the pretrained models are regarded as a frozen feature extractor in the challenge baseline system and most of the challenge submissions, and fine-tuning of the pretrained models has been rarely studied. In this work, we study the fine-tuning method of the pretrained models for SED. We first introduce ATST-Frame, our newly proposed SelfSL model, to the SED system. ATST-Frame was especially designed for learning frame-level representations of audio signals and obtained state-of-the-art (SOTA) performances on a series of downstream tasks. We then propose a fine-tuning method for ATST-Frame using both (in-domain) unlabelled and labelled SED data. Our experiments show that, the proposed method overcomes the overfitting problem when fine-tuning the large pretrained network, and our SED system obtains new SOTA results of 0.587/0.812 PSDS1/PSDS2 scores on the DCASE challenge task 4 dataset.
翻译:声音事件检测(SED)常受数据不足问题困扰。DCASE2023挑战任务4的最新基线系统利用大规模预训练自监督学习(SelfSL)模型来缓解这一限制,其中预训练模型有助于为SED生成更具判别性的特征。然而,在挑战基线系统及大多数参赛作品中,预训练模型仅被用作固定特征提取器,对其微调的研究鲜有涉及。本工作研究了针对SED的预训练模型微调方法。我们首先将新提出的SelfSL模型ATST-Frame引入SED系统。ATST-Frame专为学习音频信号的帧级表示而设计,并在系列下游任务中取得了最先进(SOTA)性能。随后,我们提出了一种利用(域内)无标注和有标注SED数据对ATST-Frame进行微调的方法。实验表明,所提方法克服了大型预训练网络微调时的过拟合问题,我们的SED系统在DCASE挑战任务4数据集上取得了0.587/0.812 PSDS1/PSDS2得分的全新SOTA结果。