Speech Emotion Recognition (SER) plays a pivotal role in enhancing human-computer interaction by enabling a deeper understanding of emotional states across a wide range of applications, contributing to more empathetic and effective communication. This study proposes an innovative approach that integrates self-supervised feature extraction with supervised classification for emotion recognition from small audio segments. In the preprocessing step, to eliminate the need of crafting audio features, we employed a self-supervised feature extractor, based on the Wav2Vec model, to capture acoustic features from audio data. Then, the output featuremaps of the preprocessing step are fed to a custom designed Convolutional Neural Network (CNN)-based model to perform emotion classification. Utilizing the ShEMO dataset as our testing ground, the proposed method surpasses two baseline methods, i.e. support vector machine classifier and transfer learning of a pretrained CNN. comparing the propose method to the state-of-the-art methods in SER task indicates the superiority of the proposed method. Our findings underscore the pivotal role of deep unsupervised feature learning in elevating the landscape of SER, offering enhanced emotional comprehension in the realm of human-computer interactions.
翻译:语音情感识别在增强人机交互中发挥着关键作用,能够跨广泛应用领域加深对情感状态的理解,有助于实现更具共情力和更高效的沟通。本研究提出一种创新方法,将自监督特征提取与监督分类相结合,用于从小音频片段中进行情感识别。在预处理阶段,为消除手工设计音频特征的需求,我们采用了基于Wav2Vec模型的自监督特征提取器,从音频数据中捕获声学特征。随后,预处理阶段输出的特征图被输入至一个定制设计的基于卷积神经网络(CNN)的模型,以执行情感分类。以ShEMO数据集为测试平台,所提方法超越了两种基线方法,即支持向量机分类器和预训练CNN的迁移学习。将所提方法与语音情感识别任务中的现有最优方法对比,结果表明了该方法的优越性。我们的研究结果强调了深度无监督特征学习在提升语音情感识别领域中的关键作用,为人机交互中的情感理解能力提供了增强。