Recently, massive architectures based on Convolutional Neural Network (CNN) and self-attention mechanisms have become necessary for audio classification. While these techniques are state-of-the-art, these works' effectiveness can only be guaranteed with huge computational costs and parameters, large amounts of data augmentation, transfer from large datasets and some other tricks. By utilizing the lightweight nature of audio, we propose an efficient network structure called Paired Inverse Pyramid Structure (PIP) and a network called Paired Inverse Pyramid Structure MLP Network (PIPMN). The PIPMN reaches 96\% of Environmental Sound Classification (ESC) accuracy on the UrbanSound8K dataset and 93.2\% of Music Genre Classification (MGC) on the GTAZN dataset, with only 1 million parameters. Both of the results are achieved without data augmentation or model transfer. Public code is available at: https://github.com/JNAIC/PIPMN
翻译:近期,基于卷积神经网络(CNN)与自注意力机制的庞大架构已成为音频分类的必要手段。尽管这些技术代表了当前最先进水平,但其有效性仅能通过极高的计算成本与参数量、大规模数据增强、大型数据集迁移及其他技巧来保证。利用音频轻量化的特性,我们提出了一种名为配对倒金字塔结构(Paired Inverse Pyramid Structure, PIP)的高效网络架构,以及基于该结构的配对倒金字塔结构MLP网络(PIPMN)。PIPMN在UrbanSound8K数据集上达到96%的环境声音分类(ESC)准确率,在GTAZN数据集上达到93.2%的音乐流派分类(MGC)准确率,且参数量仅为100万。上述两项成果均未使用数据增强或模型迁移技术。公开代码见:https://github.com/JNAIC/PIPMN