In this technical report, a low-complexity deep learning system for acoustic scene classification (ASC) is presented. The proposed system comprises two main phases: (Phase I) Training a teacher network; and (Phase II) training a student network using distilled knowledge from the teacher. In the first phase, the teacher, which presents a large footprint model, is trained. After training the teacher, the embeddings, which are the feature map of the second last layer of the teacher, are extracted. In the second phase, the student network, which presents a low complexity model, is trained with the embeddings extracted from the teacher. Our experiments conducted on DCASE 2023 Task 1 Development dataset have fulfilled the requirement of low-complexity and achieved the best classification accuracy of 57.4%, improving DCASE baseline by 14.5%.
翻译:本技术报告提出了一种低复杂度的深度学习系统用于声场景分类。该系统包含两个主要阶段:(阶段一)训练教师网络;(阶段二)利用教师网络蒸馏的知识训练学生网络。在第一阶段中,训练具有大规模参数量的教师模型,随后提取教师网络倒数第二层的特征映射作为嵌入表示。第二阶段中,采用低复杂度模型的学生网络将基于这些从教师网络提取的嵌入表示进行训练。在DCASE 2023任务1开发数据集上的实验表明,本方法在满足低复杂度要求的同时,实现了57.4%的最优分类准确率,较DCASE基线系统提升了14.5%。