Machine learning models for speech emotion recognition (SER) can be trained for different tasks and are usually evaluated on the basis of a few available datasets per task. Tasks could include arousal, valence, dominance, emotional categories, or tone of voice. Those models are mainly evaluated in terms of correlation or recall, and always show some errors in their predictions. The errors manifest themselves in model behaviour, which can be very different along different dimensions even if the same recall or correlation is achieved by the model. This paper introduces a testing framework to investigate behaviour of speech emotion recognition models, by requiring different metrics to reach a certain threshold in order to pass a test. The test metrics can be grouped in terms of correctness, fairness, and robustness. It further provides a method to specify test thresholds for fairness tests automatically, based on the used datasets, and recommendations how to select the remaining test thresholds. Seven different transformer based models, and a baseline model are tested for arousal, valence, dominance, and emotional categories. The test results highlight, that models with high correlation or recall might rely on shortcuts - such as text sentiment - to achieve this, and differ in terms of fairness.
翻译:用于语音情感识别(SER)的机器学习模型可以针对不同的任务进行训练,并且通常基于每个任务可用的少数几个数据集进行评估。任务可能包括唤醒度、效价、支配度、情感类别或语音语调。这些模型主要根据相关性或召回率进行评估,并且其预测总是存在一些错误。这些错误体现在模型行为中,即使模型达到了相同的召回率或相关性,其在不同维度上的行为也可能存在很大差异。本文引入了一个测试框架来研究语音情感识别模型的行为,要求不同的指标达到特定阈值才能通过测试。测试指标可以根据正确性、公平性和鲁棒性进行分组。该框架进一步提供了一种方法,能够基于所使用的数据集自动指定公平性测试的阈值,并就如何选择其余测试阈值给出了建议。本文针对唤醒度、效价、支配度和情感类别任务,测试了七个不同的基于Transformer的模型以及一个基线模型。测试结果突出表明,具有高相关性或高召回率的模型可能依赖捷径(例如文本情感)来实现这一点,并且在公平性方面存在差异。