Speech emotion recognition (SER) has drawn increasing attention for its applications in human-machine interaction. However, existing SER methods ignore the information gap between the pre-training speech recognition task and the downstream SER task, leading to sub-optimal performance. Moreover, they require much time to fine-tune on each specific speech dataset, restricting their effectiveness in real-world scenes with large-scale noisy data. To address these issues, we propose an active learning (AL) based Fine-Tuning framework for SER that leverages task adaptation pre-training (TAPT) and AL methods to enhance performance and efficiency. Specifically, we first use TAPT to minimize the information gap between the pre-training and the downstream task. Then, AL methods are used to iteratively select a subset of the most informative and diverse samples for fine-tuning, reducing time consumption. Experiments demonstrate that using only 20\%pt. samples improves 8.45\%pt. accuracy and reduces 79\%pt. time consumption.
翻译:语音情感识别(SER)在人机交互应用中受到越来越多的关注。然而,现有SER方法忽略了预训练语音识别任务与下游SER任务之间的信息鸿沟,导致性能次优。此外,这些方法需要在特定语音数据集上进行大量时间微调,限制了其在包含大规模噪声数据的真实场景中的有效性。为解决上述问题,我们提出一种基于主动学习(AL)的SER微调框架,该框架利用任务自适应预训练(TAPT)和主动学习方法提升性能与效率。具体而言,我们首先采用TAPT最小化预训练任务与下游任务之间的信息差距;随后,利用主动学习方法迭代选取信息量最大且最具多样性的样本子集进行微调,从而降低时间消耗。实验表明,仅使用20%的数据样本即可提升8.45%的准确率,并减少79%的时间消耗。