Speech emotion recognition (SER) has advanced significantly for the sake of deep-learning methods, while textual information further enhances its performance. However, few studies have focused on the physiological information during speech production, which also encompasses speaker traits, including emotional states. To bridge this gap, we conducted a series of experiments to investigate the potential of the phonation excitation information and articulatory kinematics for SER. Due to the scarcity of training data for this purpose, we introduce a portrayed emotional dataset, STEM-E2VA, which includes audio and physiological data such as electroglottography (EGG) and electromagnetic articulography (EMA). EGG and EMA provide information of phonation excitation and articulatory kinematics, respectively. Additionally, we performed emotion recognition using estimated physiological data derived through inversion methods from speech, instead of collected EGG and EMA, to explore the feasibility of applying such physiological information in real-world SER. Experimental results confirm the effectiveness of incorporating physiological information about speech production for SER and demonstrate its potential for practical use in real-world scenarios.
翻译:得益于深度学习方法的发展,语音情感识别(SER)已取得显著进展,而文本信息的融入进一步提升了其性能。然而,目前鲜有研究关注语音产生过程中的生理信息,这些信息同样蕴含说话者的特征,包括情感状态。为填补这一空白,我们开展了一系列实验,探究发声激励信息与发音运动学在SER中的潜力。针对相关训练数据稀缺的问题,我们构建了一个模拟情感数据集STEM-E2VA,其中包含音频及生理数据,如电声门图(EGG)和电磁发音仪(EMA)数据。EGG与EMA分别提供了发声激励信息与发音运动学信息。此外,我们利用通过语音反演方法估计得到的生理数据(而非直接采集的EGG与EMA数据)进行情感识别,以探索此类生理信息在实际SER应用中的可行性。实验结果证实,融合语音产生的生理信息可有效提升SER性能,并展现了其在真实场景中实际应用的潜力。