Audio classification is vital in areas such as speech and music recognition. Feature extraction from the audio signal, such as Mel-Spectrograms and MFCCs, is a critical step in audio classification. These features are transformed into spectrograms for classification. Researchers have explored various techniques, including traditional machine and deep learning methods to classify spectrograms, but these can be computationally expensive. To simplify this process, a more straightforward approach inspired by sequence classification in NLP can be used. This paper proposes a Transformer-encoder-based model for audio classification using MFCCs. The model was benchmarked against the ESC-50, Speech Commands v0.02 and UrbanSound8k datasets and has shown strong performance, with the highest accuracy of 95.2% obtained upon training the model on the UrbanSound8k dataset. The model consisted of a mere 127,544 total parameters, making it light-weight yet highly efficient at the audio classification task.
翻译:音频分类在语音和音乐识别等领域至关重要。从音频信号中提取特征(如梅尔频谱图和MFCC)是音频分类的关键步骤。这些特征被转换为频谱图用于分类。研究人员探索了多种技术,包括传统机器学习和深度学习方法对频谱图进行分类,但这些方法计算成本较高。为简化这一过程,可借鉴自然语言处理中序列分类的思路,采用更直接的方法。本文提出一种基于Transformer编码器的MFCC音频分类模型。该模型在ESC-50、Speech Commands v0.02和UrbanSound8k数据集上进行了基准测试,表现出强劲性能,其中在UrbanSound8k数据集上训练获得最高准确率95.2%。模型仅含127,544个总参数量,兼具轻量性与高效性。