Fairness in deep learning models trained with high-dimensional inputs and subjective labels remains a complex and understudied area. Facial emotion recognition, a domain where datasets are often racially imbalanced, can lead to models that yield disparate outcomes across racial groups. This study focuses on analyzing racial bias by sub-sampling training sets with varied racial distributions and assessing test performance across these simulations. Our findings indicate that smaller datasets with posed faces improve on both fairness and performance metrics as the simulations approach racial balance. Notably, the F1-score increases by $27.2\%$ points, and demographic parity increases by $15.7\%$ points on average across the simulations. However, in larger datasets with greater facial variation, fairness metrics generally remain constant, suggesting that racial balance by itself is insufficient to achieve parity in test performance across different racial groups.
翻译:基于高维输入和主观标签训练的深度学习模型中的公平性仍然是一个复杂且研究不足的领域。面部情绪识别作为一类常存在数据集种族不平衡的领域,可能导致模型对不同种族群体产生差异化的结果。本研究通过子采样具有不同种族分布的训练集,并评估这些模拟场景下的测试性能,聚焦于种族偏见的分析。我们的研究结果表明,当模拟场景趋近种族平衡时,采用面部表情受控姿态的小型数据集在公平性和性能指标上均有所提升。值得注意的是,在所有模拟场景中,F1分数平均提升27.2个百分点,而人口学均等性平均提升15.7个百分点。然而,在面部变化更丰富的大型数据集中,公平性指标通常保持恒定,这表明仅凭种族平衡本身并不足以实现不同种族群体间测试性能的均等。