Machine learning models for speech emotion recognition (SER) can be trained for different tasks and are usually evaluated on the basis of a few available datasets per task. Tasks could include arousal, valence, dominance, emotional categories, or tone of voice. Those models are mainly evaluated in terms of correlation or recall, and always show some errors in their predictions. The errors manifest themselves in model behaviour, which can be very different along different dimensions even if the same recall or correlation is achieved by the model. This paper investigates behavior of speech emotion recognition models with a testing framework which requires models to fulfill conditions in terms of correctness, fairness, and robustness.
翻译:用于语音情感识别(SER)的机器学习模型可针对不同任务进行训练,通常基于每个任务现有的少量数据集进行评估。任务可能涉及唤醒度、效价、支配性、情感类别或语音语调。这些模型主要通过相关系数或召回率进行评估,且其预测结果始终存在一定误差。误差表现为模型行为在不同维度上的显著差异——即便模型达到相同的召回率或相关系数,其行为特性也可能截然不同。本文提出了一种测试框架,要求语音情感识别模型在正确性、公平性和鲁棒性方面满足特定条件,并基于该框架对模型行为展开研究。