Machine learning models for speech emotion recognition (SER) can be trained for different tasks and are usually evaluated based on a few available datasets per task. Tasks could include arousal, valence, dominance, emotional categories, or tone of voice. Those models are mainly evaluated in terms of correlation or recall, and always show some errors in their predictions. The errors manifest themselves in model behaviour, which can be very different along different dimensions even if the same recall or correlation is achieved by the model. This paper introduces a testing framework to investigate behaviour of speech emotion recognition models, by requiring different metrics to reach a certain threshold in order to pass a test. The test metrics can be grouped in terms of correctness, fairness, and robustness. It also provides a method for automatically specifying test thresholds for fairness tests, based on the datasets used, and recommendations on how to select the remaining test thresholds. Nine different transformer based models, an xLSTM based model and a convolutional baseline model are tested for arousal, valence, dominance, and emotional categories. The test results highlight, that models with high correlation or recall might rely on shortcuts -- such as text sentiment --, and differ in terms of fairness.
翻译:语音情感识别(SER)的机器学习模型可针对不同任务进行训练,通常基于每个任务可用的少数数据集进行评估。任务可包括唤醒度、效价、支配度、情感类别或语音语调。这些模型主要依据相关性或召回率进行评估,其预测结果总会存在一定误差。这些误差体现在模型行为中,即使模型达到相同的召回率或相关性,其在不同维度上的表现也可能存在显著差异。本文提出一个测试框架,通过要求不同指标达到特定阈值以通过测试,从而研究语音情感识别模型的行为。测试指标可分为正确性、公平性与鲁棒性三类。该框架还提供了一种基于所用数据集自动设定公平性测试阈值的方法,并对如何选择其余测试阈值提出了建议。研究对九个基于Transformer的模型、一个基于xLSTM的模型以及一个卷积基线模型在唤醒度、效价、支配度和情感类别任务上进行了测试。测试结果表明,具有高相关性或召回率的模型可能依赖捷径(如文本情感),且在公平性方面存在差异。