Speech emotion recognition (SER) has garnered increasing attention due to its wide range of applications in various fields, including human-machine interaction, virtual assistants, and mental health assistance. However, existing SER methods often overlook the information gap between the pre-training speech recognition task and the downstream SER task, resulting in sub-optimal performance. Moreover, current methods require much time for fine-tuning on each specific speech dataset, such as IEMOCAP, which limits their effectiveness in real-world scenarios with large-scale noisy data. To address these issues, we propose an active learning (AL)-based fine-tuning framework for SER, called \textsc{After}, that leverages task adaptation pre-training (TAPT) and AL methods to enhance performance and efficiency. Specifically, we first use TAPT to minimize the information gap between the pre-training speech recognition task and the downstream speech emotion recognition task. Then, AL methods are employed to iteratively select a subset of the most informative and diverse samples for fine-tuning, thereby reducing time consumption. Experiments demonstrate that our proposed method \textsc{After}, using only 20\% of samples, improves accuracy by 8.45\% and reduces time consumption by 79\%. The additional extension of \textsc{After} and ablation studies further confirm its effectiveness and applicability to various real-world scenarios. Our source code is available on Github for reproducibility. (https://github.com/Clearloveyuan/AFTER).
翻译:语音情感识别因其在人机交互、虚拟助手和心理健康辅助等领域的广泛应用而受到越来越多的关注。然而,现有语音情感识别方法往往忽略了预训练语音识别任务与下游语音情感识别任务之间的信息差距,导致性能欠佳。此外,当前方法在IEMOCAP等特定语音数据集上需要大量时间进行微调,这限制了其在包含大规模噪声数据的真实场景中的有效性。为解决这些问题,我们提出了一种基于主动学习的语音情感识别微调框架,称为After,该框架利用任务自适应预训练和主动学习方法来提升性能与效率。具体而言,我们首先使用任务自适应预训练来最小化预训练语音识别任务与下游语音情感识别任务之间的信息差距。然后,采用主动学习方法迭代地选择信息量最大且多样化的样本子集进行微调,从而减少时间消耗。实验表明,我们提出的After方法仅使用20%的样本,就将准确率提升了8.45%,并将时间消耗减少了79%。进一步的After扩展实验和消融研究也证实了其在多种真实场景中的有效性和适用性。我们的源代码可在GitHub上获取以促进可复现性。