Fine-tune the pretrained ATST model for sound event detection

Sound event detection (SED) often suffers from the data deficiency problem. The recent baseline system in the DCASE2023 challenge task 4 leverages the large pretrained self-supervised learning (SelfSL) models to mitigate such restriction, where the pretrained models help to produce more discriminative features for SED. However, the pretrained models are regarded as a frozen feature extractor in the challenge baseline system and most of the challenge submissions, and fine-tuning of the pretrained models has been rarely studied. In this work, we study the fine-tuning method of the pretrained models for SED. We first introduce ATST-Frame, our newly proposed SelfSL model, to the SED system. ATST-Frame was especially designed for learning frame-level representations of audio signals and obtained state-of-the-art (SOTA) performances on a series of downstream tasks. We then propose a fine-tuning method for ATST-Frame using both (in-domain) unlabelled and labelled SED data. Our experiments show that, the proposed method overcomes the overfitting problem when fine-tuning the large pretrained network, and our SED system obtains new SOTA results of 0.587/0.812 PSDS1/PSDS2 scores on the DCASE challenge task 4 dataset.

翻译：声音事件检测（SED）常面临数据不足的问题。DCASE2023挑战任务4的最新基线系统利用大规模预训练自监督学习（SelfSL）模型来缓解这一限制，其中预训练模型有助于为SED生成更具判别力的特征。然而，在挑战基线系统及多数参赛提交方案中，预训练模型仅被用作冻结的特征提取器，针对预训练模型的微调方法鲜有研究。本研究探讨了用于SED的预训练模型微调方法。我们首先将新提出的SelfSL模型ATST-Frame引入SED系统。ATST-Frame专为学习音频信号的帧级表示而设计，并在多项下游任务中取得了最先进（SOTA）性能。随后我们提出了一种利用（域内）无标签和标签化SED数据对ATST-Frame进行微调的方法。实验表明，所提方法克服了大规模预训练网络微调时的过拟合问题，我们的SED系统在DCASE挑战任务4数据集上取得了0.587/0.812 PSDS1/PSDS2分数的新SOTA结果。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日