Speech emotion recognition is a challenging research topic that plays a critical role in human-computer interaction. Multimodal inputs further improve the performance as more emotional information is used. However, existing studies learn all the information in the sample while only a small portion of it is about emotion. The redundant information will become noises and limit the system performance. In this paper, a key-sparse Transformer is proposed for efficient emotion recognition by focusing more on emotion related information. The proposed method is evaluated on the IEMOCAP and LSSED. Experimental results show that the proposed method achieves better performance than the state-of-the-art approaches.
翻译:语音情感识别是一个具有挑战性的研究课题,在人机交互中扮演着关键角色。多模态输入通过利用更丰富的情感信息进一步提升了系统性能。然而,现有方法会学习样本中的所有信息,而其中仅有小部分与情感相关。冗余信息将转化为噪声,限制系统性能。本文提出一种关键稀疏Transformer,通过聚焦情感相关特征实现高效情感识别。该方法在IEMOCAP与LSSED数据集上进行了评估。实验结果表明,所提方法在性能上优于现有最先进方法。