Detection of human emotions based on facial images in real-world scenarios is a difficult task due to low image quality, variations in lighting, pose changes, background distractions, small inter-class variations, noisy crowd-sourced labels, and severe class imbalance, as observed in the FER-2013 dataset of 48x48 grayscale images. Although recent approaches using large CNNs such as VGG and ResNet achieve reasonable accuracy, they are computationally expensive and memory-intensive, limiting their practicality for real-time applications. We address these challenges using a lightweight and efficient facial emotion recognition pipeline based on EfficientNetB2, trained using a two-stage warm-up and fine-tuning strategy. The model is enhanced with AdamW optimization, decoupled weight decay, label smoothing (epsilon = 0.06) to reduce annotation noise, and clipped class weights to mitigate class imbalance, along with dropout, mixed-precision training, and extensive real-time data augmentation. The model is trained using a stratified 87.5%/12.5% train-validation split while keeping the official test set intact, achieving a test accuracy of 68.78% with nearly ten times fewer parameters than VGG16-based baselines. Experimental results, including per-class metrics and learning dynamics, demonstrate stable training and strong generalization, making the proposed approach suitable for real-time and edge-based applications.
翻译:在现实场景中,基于面部图像进行人类情绪检测是一项困难的任务,这主要源于图像质量低、光照变化、姿态改变、背景干扰、类间差异小、众包标注噪声以及严重的类别不平衡等问题,这在FER-2013数据集(包含48×48灰度图像)中尤为明显。尽管近期采用VGG和ResNet等大型CNN的方法取得了合理的准确率,但它们计算成本高、内存需求大,限制了其在实时应用中的实用性。我们通过一个基于EfficientNetB2的轻量高效面部情绪识别流程应对这些挑战,该模型采用两阶段预热与微调策略进行训练。模型通过AdamW优化、解耦权重衰减、标签平滑(epsilon = 0.06)以减少标注噪声,以及裁剪类别权重来缓解类别不平衡问题,同时结合了Dropout、混合精度训练和大量实时数据增强技术。模型采用分层划分的87.5%/12.5%训练-验证集进行训练,并保持官方测试集不变,最终在测试集上取得了68.78%的准确率,且参数量仅为基于VGG16基线模型的近十分之一。实验结果(包括每类指标和学习动态分析)表明训练过程稳定且泛化能力强,使得所提方法适用于实时及边缘计算应用。