Facial Expression Recognition is a commercially-important application, but one under-appreciated limitation is that such applications require making predictions on out-of-sample distributions, where target images have different properties from the images the model was trained on. How well -- or how badly -- do facial expression recognition models do on unseen target domains? We provide a systematic and critical evaluation of transfer learning -- specifically, domain generalization -- in facial expression recognition. Using a state-of-the-art model with twelve datasets (six collected in-lab and six ``in-the-wild"), we conduct extensive round-robin-style experiments to evaluate classification accuracies when given new data from an unseen dataset. We also perform multi-source experiments to examine a model's ability to generalize from multiple source datasets, including (i) within-setting (e.g., lab to lab), (ii) cross-setting (e.g., in-the-wild to lab), and (iii) leave-one-out settings. Finally, we compare our results with three commercially-available software. We find sobering results: the accuracy of single- and multi-source domain generalization is only modest. Even for the best-performing multi-source settings, we observe average classification accuracies of 65.6% (range: 34.6%-88.6%; chance: 14.3%), corresponding to an average drop of 10.8 percentage points from the within-corpus classification performance (mean: 76.4%). We discuss the need for regular, systematic investigations into the generalizability of affective computing models and applications.
翻译:面部表情识别是一项具有商业价值的重要应用,但一个常被低估的局限在于:此类应用需对分布外样本进行预测,即目标图像与模型训练所使用的图像具有不同属性。面部表情识别模型在未见过的目标域上表现如何(或有多差)?本文对面部表情识别中的迁移学习——特别是领域泛化——进行了系统性、批判性的评估。我们采用一个包含十二个数据集(六个实验室采集数据集和六个“野外”数据集)的最新模型,开展广泛的循环赛式实验,以评估模型在面对未见数据集中的新数据时的分类准确率。同时,我们通过多源实验考察模型从多个源数据集进行泛化的能力,包括:(i)同类型场景迁移(如实验室到实验室)、(ii)跨类型场景迁移(如野外到实验室)以及(iii)留一法迁移设置。最后,我们将结果与三款商业软件进行对比。研究结果令人警醒:单源与多源领域泛化的准确率仅属中等。即使在性能最佳的多源设置下,平均分类准确率也仅为65.6%(范围:34.6%-88.6%;随机基线:14.3%),较语料库内分类性能(平均76.4%)下降了10.8个百分点。我们呼吁对情感计算模型与应用的可泛化性进行常态化、系统性的检验。