Traditional approaches in speech emotion recognition, such as LSTM, CNN, RNN, SVM, and MLP, have limitations such as difficulty capturing long-term dependencies in sequential data, capturing the temporal dynamics, and struggling to capture complex patterns and relationships in multimodal data. This research addresses these shortcomings by proposing an ensemble model that combines Graph Convolutional Networks (GCN) for processing textual data and the HuBERT transformer for analyzing audio signals. We found that GCNs excel at capturing Long-term contextual dependencies and relationships within textual data by leveraging graph-based representations of text and thus detecting the contextual meaning and semantic relationships between words. On the other hand, HuBERT utilizes self-attention mechanisms to capture long-range dependencies, enabling the modeling of temporal dynamics present in speech and capturing subtle nuances and variations that contribute to emotion recognition. By combining GCN and HuBERT, our ensemble model can leverage the strengths of both approaches. This allows for the simultaneous analysis of multimodal data, and the fusion of these modalities enables the extraction of complementary information, enhancing the discriminative power of the emotion recognition system. The results indicate that the combined model can overcome the limitations of traditional methods, leading to enhanced accuracy in recognizing emotions from speech.
翻译:传统语音情感识别方法,如长短期记忆网络、卷积神经网络、循环神经网络、支持向量机及多层感知机等,存在难以捕捉序列数据中的长期依赖关系、无法有效建模时间动态特性、以及在多模态数据中提取复杂模式与关联时表现不足等局限。本研究通过提出一种集成模型来解决上述问题,该模型结合了图卷积网络(GCN)处理文本数据,以及HuBERT Transformer分析音频信号。我们研究发现,GCN通过利用基于图的文本表示,善于捕捉文本数据中的长期上下文依赖关系与语义关联,从而检测词语间的语境含义与语义关系。同时,HuBERT利用自注意力机制捕获长程依赖关系,能够对语音中的时间动态特性进行建模,并捕捉对情感识别至关重要的细微韵律变化与波动。通过融合GCN与HuBERT,本集成模型可充分发挥两种方法的优势,实现多模态数据的同步分析。这种模态融合机制能够提取互补性信息,显著提升情感识别系统的判别能力。实验结果表明,该组合模型能够克服传统方法的局限性,显著提高从语音中识别情感的准确率。