This work is an improved system that we submitted to task 1 of DCASE2023 challenge. We propose a method of low-complexity acoustic scene classification by a parallel attention-convolution network which consists of four modules, including pre-processing, fusion, global and local contextual information extraction. The proposed network is computationally efficient to capture global and local contextual information from each audio clip. In addition, we integrate other techniques into our method, such as knowledge distillation, data augmentation, and adaptive residual normalization. When evaluated on the official dataset of DCASE2023 challenge, our method obtains the highest accuracy of 56.10% with parameter number of 5.21 kilo and multiply-accumulate operations of 1.44 million. It exceeds the top two systems of DCASE2023 challenge in accuracy and complexity, and obtains state-of-the-art result. Code is at: https://github.com/Jessytan/Low-complexity-ASC.
翻译:本工作是我们提交至DCASE2023挑战赛任务1的改进系统。我们提出了一种通过并行注意力-卷积网络实现低复杂度声学场景分类的方法,该网络包含四个模块:预处理模块、融合模块、全局与局部上下文信息提取模块。所提出的网络能以较低计算成本从每个音频片段中捕获全局与局部上下文信息。此外,我们在方法中整合了其他技术,包括知识蒸馏、数据增强和自适应残差归一化。在DCASE2023挑战赛官方数据集上的评估结果表明,我们的方法以5.21千参数和1.44百万乘加运算量取得了56.10%的最高准确率。该方法在准确率与复杂度方面均超越DCASE2023挑战赛前两名系统,获得了当前最优性能。代码位于:https://github.com/Jessytan/Low-complexity-ASC。