Critically examining the Domain Generalizability of Facial Expression Recognition models

Facial Expression Recognition is a commercially-important application, but one under-appreciated limitation is that such applications require making predictions on out-of-sample distributions, where target images have different properties from the images the model was trained on. How well -- or how badly -- do facial expression recognition models do on unseen target domains? We provide a systematic and critical evaluation of transfer learning -- specifically, domain generalization -- in facial expression recognition. Using a state-of-the-art model with twelve datasets (six collected in-lab and six ``in-the-wild"), we conduct extensive round-robin-style experiments to evaluate classification accuracies when given new data from an unseen dataset. We also perform multi-source experiments to examine a model's ability to generalize from multiple source datasets, including (i) within-setting (e.g., lab to lab), (ii) cross-setting (e.g., in-the-wild to lab), and (iii) leave-one-out settings. Finally, we compare our results with three commercially-available software. We find sobering results: the accuracy of single- and multi-source domain generalization is only modest. Even for the best-performing multi-source settings, we observe average classification accuracies of 65.6% (range: 34.6%-88.6%; chance: 14.3%), corresponding to an average drop of 10.8 percentage points from the within-corpus classification performance (mean: 76.4%). We discuss the need for regular, systematic investigations into the generalizability of affective computing models and applications.

翻译：面部表情识别是一项具有商业价值的重要应用，但一个常被低估的局限在于：此类应用需对分布外样本进行预测，即目标图像与模型训练所使用的图像具有不同属性。面部表情识别模型在未见过的目标域上表现如何（或有多差）？本文对面部表情识别中的迁移学习——特别是领域泛化——进行了系统性、批判性的评估。我们采用一个包含十二个数据集（六个实验室采集数据集和六个“野外”数据集）的最新模型，开展广泛的循环赛式实验，以评估模型在面对未见数据集中的新数据时的分类准确率。同时，我们通过多源实验考察模型从多个源数据集进行泛化的能力，包括：（i）同类型场景迁移（如实验室到实验室）、（ii）跨类型场景迁移（如野外到实验室）以及（iii）留一法迁移设置。最后，我们将结果与三款商业软件进行对比。研究结果令人警醒：单源与多源领域泛化的准确率仅属中等。即使在性能最佳的多源设置下，平均分类准确率也仅为65.6%（范围：34.6%-88.6%；随机基线：14.3%），较语料库内分类性能（平均76.4%）下降了10.8个百分点。我们呼吁对情感计算模型与应用的可泛化性进行常态化、系统性的检验。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/