Emotion recognition (ER) from speech signals is a robust approach since it cannot be imitated like facial expression or text based sentiment analysis. Valuable information underlying the emotions are significant for human-computer interactions enabling intelligent machines to interact with sensitivity in the real world. Previous ER studies through speech signal processing have focused exclusively on associations between different signal mode decomposition methods and hidden informative features. However, improper decomposition parameter selections lead to informative signal component losses due to mode duplicating and mixing. In contrast, the current study proposes VGG-optiVMD, an empowered variational mode decomposition algorithm, to distinguish meaningful speech features and automatically select the number of decomposed modes and optimum balancing parameter for the data fidelity constraint by assessing their effects on the VGG16 flattening output layer. Various feature vectors were employed to train the VGG16 network on different databases and assess VGG-optiVMD reproducibility and reliability. One, two, and three-dimensional feature vectors were constructed by concatenating Mel-frequency cepstral coefficients, Chromagram, Mel spectrograms, Tonnetz diagrams, and spectral centroids. Results confirmed a synergistic relationship between the fine-tuning of the signal sample rate and decomposition parameters with classification accuracy, achieving state-of-the-art 96.09% accuracy in predicting seven emotions on the Berlin EMO-DB database.
翻译:语音情感识别是一种稳健的方法,因为它无法像面部表情或基于文本的情感分析那样被模仿。情感蕴含的宝贵信息对于人机交互至关重要,能使智能机器与现实世界进行敏感互动。以往通过语音信号处理进行情感识别的研究主要关注不同信号模态分解方法与隐藏信息特征之间的关联。然而,不恰当的分解参数选择会导致模态重复和混合,从而造成信息性信号成分的损失。相比之下,本研究提出了VGG-optiVMD,一种增强的变分模态分解算法,通过评估不同分解模态数量和数据保真度约束的最优平衡参数对VGG16展平输出层的影响,来区分有意义的语音特征并自动选择分解模态数量。采用多种特征向量在不同数据库上训练VGG16网络,并评估VGG-optiVMD的可复现性和可靠性。通过拼接梅尔频率倒谱系数、色度图、梅尔频谱图、通奈兹图和频谱质心,构建了一维、二维和三维特征向量。结果证实,信号采样率和分解参数的微调与分类准确率之间存在协同关系,在柏林EMO-DB数据库上预测七种情感时达到了96.09%的最先进准确率。