This paper presents an innovative approach to address the challenges of translating multi-modal emotion recognition models to a more practical and resource-efficient uni-modal counterpart, specifically focusing on speech-only emotion recognition. Recognizing emotions from speech signals is a critical task with applications in human-computer interaction, affective computing, and mental health assessment. However, existing state-of-the-art models often rely on multi-modal inputs, incorporating information from multiple sources such as facial expressions and gestures, which may not be readily available or feasible in real-world scenarios. To tackle this issue, we propose a novel framework that leverages knowledge distillation and masked training techniques.
翻译:本文提出了一种创新方法,旨在解决将多模态情感识别模型迁移至更实用且资源高效的对应单模态模型(尤其聚焦于纯语音情感识别)时所面临的挑战。从语音信号中识别情感是一项关键任务,在人机交互、情感计算及心理健康评估等领域具有重要应用价值。然而,现有最先进模型往往依赖多模态输入,需整合面部表情、手势等多源信息,这在现实场景中可能难以获取或不具备可行性。为应对此问题,我们提出了一种融合知识蒸馏与掩码训练技术的新颖框架。