Speech emotion recognition (SER) is vital for obtaining emotional intelligence and understanding the contextual meaning of speech. Variations of consonant-vowel (CV) phonemic boundaries can enrich acoustic context with linguistic cues, which impacts SER. In practice, speech emotions are treated as single labels over an acoustic segment for a given time duration. However, phone boundaries within speech are not discrete events, therefore the perceived emotion state should also be distributed over potentially continuous time-windows. This research explores the implication of acoustic context and phone boundaries on local markers for SER using an attention-based approach. The benefits of using a distributed approach to speech emotion understanding are supported by the results of cross-corpora analysis experiments. Experiments where phones and words are mapped to the attention vectors along with the fundamental frequency to observe the overlapping distributions and thereby the relationship between acoustic context and emotion. This work aims to bridge psycholinguistic theory research with computational modelling for SER.
翻译:语音情感识别对于获得情感智能和理解语音的语境意义至关重要。辅音-元音音位边界的变化可以丰富声学语境中的语言学线索,从而影响语音情感识别。在实践中,语音情感在给定时间段内被视作声学片段上的单一标签进行处理。然而,语音中的音位边界并非离散事件,因此感知到的情感状态也应分布在可能连续的时间窗口上。本研究采用基于注意力机制的方法,探索声学语境和音位边界对语音情感识别局部标记的影响。跨语料库分析实验的结果支持了采用分布式方法理解语音情感的优势。通过将音素和单词与基频一同映射到注意力向量上,观察其重叠分布,进而探究声学语境与情感之间的关系。本研究旨在弥合心理语言学理论研究与语音情感识别计算模型之间的鸿沟。