Audio Spectrogram Transformer models rule the field of Audio Tagging, outrunning previously dominating Convolutional Neural Networks (CNNs). Their superiority is based on the ability to scale up and exploit large-scale datasets such as AudioSet. However, Transformers are demanding in terms of model size and computational requirements compared to CNNs. We propose a training procedure for efficient CNNs based on offline Knowledge Distillation (KD) from high-performing yet complex transformers. The proposed training schema and the efficient CNN design based on MobileNetV3 results in models outperforming previous solutions in terms of parameter and computational efficiency and prediction performance. We provide models of different complexity levels, scaling from low-complexity models up to a new state-of-the-art performance of .483 mAP on AudioSet. Source Code available at: https://github.com/fschmid56/EfficientAT
翻译:音频频谱图Transformer模型在音频标注领域占据主导地位,超越了此前占据主导地位的卷积神经网络(CNN)。其优势源于能够扩展规模并利用AudioSet等大规模数据集。然而,与CNN相比,Transformer在模型规模和计算需求方面要求更高。我们提出了一种针对高效CNN的训练流程,该流程基于从高性能但复杂的Transformer进行的离线知识蒸馏(KD)。所提出的训练方案及基于MobileNetV3的高效CNN设计,使得模型在参数效率、计算效率和预测性能方面均超越了先前解决方案。我们提供了不同复杂度级别的模型,涵盖从低复杂度模型到在AudioSet上达到.483 mAP的最新先进性能。源代码地址:https://github.com/fschmid56/EfficientAT