In human-computer interaction (HCI), Speech Emotion Recognition (SER) is a key technology for understanding human intentions and emotions. Traditional SER methods struggle to effectively capture the long-term temporal correla-tions and dynamic variations in complex emotional expressions. To overcome these limitations, we introduce the PCQ method, a pioneering approach for SER via \textbf{P}rogressive \textbf{C}hannel \textbf{Q}uerying. This method can drill down layer by layer in the channel dimension through the channel query technique to achieve dynamic modeling of long-term contextual information of emotions. This mul-ti-level analysis gives the PCQ method an edge in capturing the nuances of hu-man emotions. Experimental results show that our model improves the weighted average (WA) accuracy by 3.98\% and 3.45\% and the unweighted av-erage (UA) accuracy by 5.67\% and 5.83\% on the IEMOCAP and EMODB emotion recognition datasets, respectively, significantly exceeding the baseline levels.
翻译:在人机交互(HCI)领域,语音情感识别(SER)是理解人类意图与情感的关键技术。传统SER方法难以有效捕捉复杂情感表达中的长时程时序关联与动态变化。为克服这些局限,本文提出PCQ方法,这是一种通过**渐进式通道查询**进行SER的创新方法。该方法可通过通道查询技术在通道维度逐层深入,实现对情感长时上下文信息的动态建模。这种多层次分析使PCQ方法在捕捉人类情感细微差异方面具有优势。实验结果表明,在IEMOCAP和EMODB情感识别数据集上,我们的模型将加权平均(WA)准确率分别提升3.98%和3.45%,未加权平均(UA)准确率分别提升5.67%和5.83%,显著超越基线水平。