Audio classification is vital in areas such as speech and music recognition. Feature extraction from the audio signal, such as Mel-Spectrograms and MFCCs, is a critical step in audio classification. These features are transformed into spectrograms for classification. Researchers have explored various techniques, including traditional machine and deep learning methods to classify spectrograms, but these can be computationally expensive. To simplify this process, a more straightforward approach inspired by sequence classification in NLP can be used. This paper proposes a Transformer-encoder-based model for audio classification using MFCCs. The model was benchmarked against the ESC-50, Speech Commands v0.02 and UrbanSound8k datasets and has shown strong performance, with the highest accuracy of 95.2% obtained upon training the model on the UrbanSound8k dataset. The model consisted of a mere 127,544 total parameters, making it light-weight yet highly efficient at the audio classification task.
翻译:音频分类在语音和音乐识别等领域至关重要。从音频信号中提取特征(如梅尔频谱图和MFCC)是音频分类的关键步骤。这些特征被转换为频谱图以进行分类。研究者探索了包括传统机器学习和深度学习方法在内的多种技术来分类频谱图,但这些方法可能计算成本高昂。为简化此过程,可借鉴自然语言处理中序列分类的更为直接的方法。本文提出了一种基于Transformer编码器的音频分类模型,该模型使用MFCC特征。模型在ESC-50、Speech Commands v0.02和UrbanSound8k数据集上进行了基准测试,表现出色,在UrbanSound8k数据集上训练时最高准确率达95.2%。该模型仅包含127,544个总参数,因而在音频分类任务中兼具轻量化与高效性。