Speech Emotion Recognition (SER) affective technology enables the intelligent embedded devices to interact with sensitivity. Similarly, call centre employees recognise customers' emotions from their pitch, energy, and tone of voice so as to modify their speech for a high-quality interaction with customers. This work explores, for the first time, the effects of the harmonic and percussive components of Mel spectrograms in SER. We attempt to leverage the Mel spectrogram by decomposing distinguishable acoustic features for exploitation in our proposed architecture, which includes a novel feature map generator algorithm, a CNN-based network feature extractor and a multi-layer perceptron (MLP) classifier. This study specifically focuses on effective data augmentation techniques for building an enriched hybrid-based feature map. This process results in a function that outputs a 2D image so that it can be used as input data for a pre-trained CNN-VGG16 feature extractor. Furthermore, we also investigate other acoustic features such as MFCCs, chromagram, spectral contrast, and the tonnetz to assess our proposed framework. A test accuracy of 92.79% on the Berlin EMO-DB database is achieved. Our result is higher than previous works using CNN-VGG16.
翻译:语音情感识别(SER)情感技术使智能嵌入式设备能够进行敏感性交互。同样,呼叫中心员工通过客户语音的音高、能量和语调来识别其情感,从而调整自身表达以实现高质量互动。本研究首次探索梅尔频谱图中谐波与打击成分对SER的影响。我们尝试通过分解可区分的声学特征来增强梅尔频谱图,并将其应用于提出的架构中,该架构包含新颖的特征图生成算法、基于CNN的网络特征提取器及多层感知机(MLP)分类器。本研究重点聚焦于有效的数据增强技术,以构建增强型混合特征图。该过程生成一个输出二维图像的函数,使其可作为预训练CNN-VGG16特征提取器的输入数据。此外,我们还研究了其他声学特征如MFCC、色度图、频谱对比度和音调网络来评估所提框架。在柏林EMO-DB数据库上获得了92.79%的测试准确率,该结果优于先前使用CNN-VGG16的研究成果。