Exploring proper way to conduct multi-speech feature fusion for cross-corpus speech emotion recognition is crucial as different speech features could provide complementary cues reflecting human emotion status. While most previous approaches only extract a single speech feature for emotion recognition, existing fusion methods such as concatenation, parallel connection, and splicing ignore heterogeneous patterns in the interaction between features and features, resulting in performance of existing systems. In this paper, we propose a novel graph-based fusion method to explicitly model the relationships between every pair of speech features. Specifically, we propose a multi-dimensional edge features learning strategy called Graph-based multi-Feature fusion method for speech emotion recognition. It represents each speech feature as a node and learns multi-dimensional edge features to explicitly describe the relationship between each feature-feature pair in the context of emotion recognition. This way, the learned multi-dimensional edge features encode speech feature-level information from both the vertex and edge dimensions. Our Approach consists of three modules: an Audio Feature Generation(AFG)module, an Audio-Feature Multi-dimensional Edge Feature(AMEF) module and a Speech Emotion Recognition (SER) module. The proposed methodology yielded satisfactory outcomes on the SEWA dataset. Furthermore, the method demonstrated enhanced performance compared to the baseline in the AVEC 2019 Workshop and Challenge. We used data from two cultures as our training and validation sets: two cultures containing German and Hungarian on the SEWA dataset, the CCC scores for German are improved by 17.28% for arousal and 7.93% for liking. The outcomes of our methodology demonstrate a 13% improvement over alternative fusion techniques, including those employing one dimensional edge-based feature fusion approach.
翻译:探索跨语料库语音情感识别中多语音特征融合的适当方法至关重要,因为不同的语音特征能够提供反映人类情感状态的互补线索。以往方法大多仅提取单一语音特征进行情感识别,而现有的融合方法(如拼接、并联连接和接合)忽略了特征与特征间交互的异构模式,从而限制了现有系统的性能。本文提出一种新颖的基于图的融合方法,显式建模每对语音特征之间的关系。具体而言,我们提出一种称为基于图的多特征融合语音情感识别的多维边特征学习策略。该方法将每个语音特征表示为节点,并学习多维边特征以显式描述情感识别语境中每个特征对之间的关系。通过这种方式,习得的多维边特征从顶点和边维度同时编码了语音特征层面的信息。我们的方法包含三个模块:音频特征生成模块、音频特征多维边特征模块以及语音情感识别模块。所提方法在SEWA数据集上取得了令人满意的结果。此外,在AVEC 2019研讨会与挑战赛中,该方法相较于基线模型展现出增强的性能。我们使用两种文化背景的数据作为训练集和验证集:在SEWA数据集中包含德语和匈牙利语的两种文化数据,其德语数据的唤醒度CCC分数提升了17.28%,喜好度提升了7.93%。我们的方法相较于其他融合技术(包括采用一维边特征融合的方法)实现了13%的性能提升。