Exploring proper way to conduct multi-speech feature fusion for cross-corpus speech emotion recognition is crucial as different speech features could provide complementary cues reflecting human emotion status. While most previous approaches only extract a single speech feature for emotion recognition, existing fusion methods such as concatenation, parallel connection, and splicing ignore heterogeneous patterns in the interaction between features and features, resulting in performance of existing systems. In this paper, we propose a novel graph-based fusion method to explicitly model the relationships between every pair of speech features. Specifically, we propose a multi-dimensional edge features learning strategy called Graph-based multi-Feature fusion method for speech emotion recognition. It represents each speech feature as a node and learns multi-dimensional edge features to explicitly describe the relationship between each feature-feature pair in the context of emotion recognition. This way, the learned multi-dimensional edge features encode speech feature-level information from both the vertex and edge dimensions. Our Approach consists of three modules: an Audio Feature Generation(AFG)module, an Audio-Feature Multi-dimensional Edge Feature(AMEF) module and a Speech Emotion Recognition (SER) module. The proposed methodology yielded satisfactory outcomes on the SEWA dataset. Furthermore, the method demonstrated enhanced performance compared to the baseline in the AVEC 2019 Workshop and Challenge. We used data from two cultures as our training and validation sets: two cultures containing German and Hungarian on the SEWA dataset, the CCC scores for German are improved by 17.28% for arousal and 7.93% for liking. The outcomes of our methodology demonstrate a 13% improvement over alternative fusion techniques, including those employing one dimensional edge-based feature fusion approach.
翻译:探索多语音特征融合的适当方法对于跨语料库语音情感识别至关重要,因为不同的语音特征能够提供反映人类情感状态的互补线索。尽管以往大多数方法仅提取单一语音特征进行情感识别,现有的融合方法(如拼接、并联和连接)忽略了特征与特征之间交互的异构模式,从而影响了现有系统的性能。本文提出了一种新颖的基于图的融合方法,以显式建模每对语音特征之间的关系。具体而言,我们提出了一种称为“基于图的多特征融合方法”的多维边特征学习策略,用于语音情感识别。该方法将每个语音特征表示为一个节点,并学习多维边特征以显式描述在情感识别背景下每个特征对之间的关系。通过这种方式,学习到的多维边特征从顶点和边两个维度编码了语音特征层面的信息。我们的方法包含三个模块:音频特征生成模块、音频特征多维边特征模块和语音情感识别模块。所提出的方法在SEWA数据集上取得了令人满意的结果。此外,在AVEC 2019研讨会与挑战赛中,该方法相较于基线模型表现出增强的性能。我们使用来自两种文化的数据作为训练集和验证集:在SEWA数据集中包含德国和匈牙利两种文化,对于德国数据,唤醒度的CCC分数提高了17.28%,喜好度提高了7.93%。我们的方法相较于其他融合技术(包括采用一维基于边的特征融合方法)取得了13%的性能提升。