Evaluating raw waveforms with deep learning frameworks for speech emotion recognition

Speech emotion recognition is a challenging task in speech processing field. For this reason, feature extraction process has a crucial importance to demonstrate and process the speech signals. In this work, we represent a model, which feeds raw audio files directly into the deep neural networks without any feature extraction stage for the recognition of emotions utilizing six different data sets, EMO-DB, RAVDESS, TESS, CREMA, SAVEE, and TESS+RAVDESS. To demonstrate the contribution of proposed model, the performance of traditional feature extraction techniques namely, mel-scale spectogram, mel-frequency cepstral coefficients, are blended with machine learning algorithms, ensemble learning methods, deep and hybrid deep learning techniques. Support vector machine, decision tree, naive Bayes, random forests models are evaluated as machine learning algorithms while majority voting and stacking methods are assessed as ensemble learning techniques. Moreover, convolutional neural networks, long short-term memory networks, and hybrid CNN- LSTM model are evaluated as deep learning techniques and compared with machine learning and ensemble learning methods. To demonstrate the effectiveness of proposed model, the comparison with state-of-the-art studies are carried out. Based on the experiment results, CNN model excels existent approaches with 95.86% of accuracy for TESS+RAVDESS data set using raw audio files, thence determining the new state-of-the-art. The proposed model performs 90.34% of accuracy for EMO-DB with CNN model, 90.42% of accuracy for RAVDESS with CNN model, 99.48% of accuracy for TESS with LSTM model, 69.72% of accuracy for CREMA with CNN model, 85.76% of accuracy for SAVEE with CNN model in speaker-independent audio categorization problems.

翻译：语音情感识别是语音处理领域的一项具有挑战性的任务。因此，特征提取过程对于表示和处理语音信号具有至关重要的地位。本研究提出了一种模型，该模型将原始音频文件直接输入深度神经网络，无需任何特征提取阶段，利用EMO-DB、RAVDESS、TESS、CREMA、SAVEE和TESS+RAVDESS六个不同数据集进行情感识别。为验证所提模型的贡献，将传统特征提取技术（即梅尔频谱图、梅尔频率倒谱系数）与机器学习算法、集成学习方法、深度及混合深度学习技术相结合。机器学习算法评估了支持向量机、决策树、朴素贝叶斯和随机森林模型；集成学习技术则评估了多数投票法和堆叠法。此外，深度学习技术评估了卷积神经网络、长短期记忆网络及混合CNN-LSTM模型，并与机器学习和集成学习方法进行了比较。为展示所提模型的有效性，与当前最先进的研究进行了对比。实验结果表明，CNN模型在使用TESS+RAVDESS数据集的原始音频文件时，准确率达到95.86%，超越了现有方法，确立了新的最优性能。在说话人无关的音频分类任务中，所提模型在EMO-DB数据集上采用CNN模型达到90.34%准确率，在RAVDESS数据集上采用CNN模型达到90.42%准确率，在TESS数据集上采用LSTM模型达到99.48%准确率，在CREMA数据集上采用CNN模型达到69.72%准确率，在SAVEE数据集上采用CNN模型达到85.76%准确率。