Emotion recognition from physiological signals has substantial potential for applications in mental health and emotion-aware systems. However, the lack of standardized, large-scale evaluations across heterogeneous datasets limits progress and model generalization. We introduce FEEL, the first large-scale benchmarking study of emotion recognition using electrodermal activity (EDA) and photoplethysmography (PPG) signals across 19 publicly available datasets. We evaluate 16 architectures spanning traditional machine learning, deep learning, and self-supervised pretraining approaches, structured into four representative modeling paradigms. Our study includes both within-dataset and cross-dataset evaluations, analyzing generalization across variations in experimental settings, device types, and labeling strategies. Our results showed that fine-tuned contrastive signal-language pretraining (CLSP) models (71/114) achieve the highest F1 across arousal and valence classification tasks, while simpler models like Random Forests, LDA, and MLP remain competitive (36/114). Models leveraging handcrafted features (107/114) consistently outperform those trained on raw signal segments, underscoring the value of domain knowledge in low-resource, noisy settings. Further cross-dataset analyses reveal that models trained on real-life setting data generalize well to lab (F1 = 0.79) and constraint-based settings (F1 = 0.78). Similarly, models trained on expert-annotated data transfer effectively to stimulus-labeled (F1 = 0.72) and self-reported datasets (F1 = 0.76). Moreover, models trained on lab-based devices also demonstrated high transferability to both custom wearable devices (F1 = 0.81) and the Empatica E4 (F1 = 0.73), underscoring the influence of heterogeneity. More information about FEEL can be found on our website https://alchemy18.github.io/FEEL_Benchmark/.
翻译:基于生理信号的情绪识别在心理健康和情绪感知系统中具有巨大潜力。然而,跨异构数据集缺乏标准化的大规模评估,阻碍了研究进展和模型泛化能力。我们提出了FEEL,这是首个基于皮肤电活动(EDA)和光电容积描记(PPG)信号、覆盖19个公开数据集的大规模情绪识别基准研究。我们评估了16种架构,涵盖传统机器学习、深度学习和自监督预训练方法,并将其组织为四种代表性建模范式。本研究包括数据集内和跨数据集评估,分析了实验设置、设备类型和标注策略差异下的泛化能力。结果显示,微调后的对比信号-语言预训练(CLSP)模型(71/114)在唤醒度和效价分类任务中取得了最高F1分数,而随机森林、LDA和MLP等简单模型(36/114)仍具有竞争力。基于手工特征(107/114)的模型始终优于在原始信号片段上训练的模型,突显了在低资源、噪声环境下领域知识的价值。进一步的跨数据集分析表明,在真实场景数据上训练的模型能很好地泛化到实验室(F1=0.79)和约束条件设置(F1=0.78)。类似地,在专家标注数据上训练的模型能有效迁移到刺激标注(F1=0.72)和自报告数据集(F1=0.76)。此外,基于实验室设备训练的模型也展现出对定制可穿戴设备(F1=0.81)和Empatica E4(F1=0.73)的高迁移性,揭示了异质性的影响。更多关于FEEL的信息可访问我们的网站 https://alchemy18.github.io/FEEL_Benchmark/。