Wearable sensors are widely used to collect physiological data and develop stress detection models. However, most studies focus on a single dataset, rarely evaluating model reproducibility across devices, populations, or study conditions. We previously assessed the reproducibility of stress detection models across multiple studies, testing models trained on one dataset against others using heart rate (with R-R interval) and electrodermal activity (EDA). In this study, we extended our stress detection reproducibility to consumer wearable sensors. We compared validated research-grade devices, to consumer wearables - Biopac MP160, Polar H10, Empatica E4, to the Garmin Forerunner 55s, assessing device-specific stress detection performance by conducting a new stress study on undergraduate students. Thirty-five students completed three standardized stress-induction tasks in a lab setting. Biopac MP160 performed the best, being consistent with our expectations of it as the gold standard, though performance varied across devices and models. Combining heart rate variability (HRV) and EDA enhanced stress prediction across most scenarios. However, Empatica E4 showed variability; while HRV and EDA improved stress detection in leave-one-subject-out (LOSO) evaluations (AUROC up to 0.953), device-specific limitations led to underperformance when tested with our pre-trained stress detection tool (AUROC 0.723), highlighting generalizability challenges related to hardware-model compatibility. Garmin Forerunner 55s demonstrated strong potential for real-world stress monitoring, achieving the best mental arithmetic stress detection performance in LOSO (AUROC up to 0.961) comparable to research-grade devices like Polar H10 (AUROC 0.954), and Empatica E4 (AUROC 0.905 with HRV-only model and AUROC 0.953 with HRV+EDA model), with the added advantage of consumer-friendly wearability for free-living contexts.
翻译:可穿戴传感器被广泛用于采集生理数据并开发压力检测模型。然而,大多数研究仅聚焦于单一数据集,很少评估模型在不同设备、人群或研究条件下的可复现性。我们先前已通过心率(含R-R间期)和皮肤电活动(EDA)数据,评估了压力检测模型在多项研究间的可复现性,测试了基于一个数据集训练模型在其他数据集上的表现。在本研究中,我们将压力检测的可复现性研究扩展至消费级可穿戴传感器。通过在一项针对本科生开展的新压力研究中,我们比较了已验证的研究级设备(Biopac MP160、Polar H10、Empatica E4)与消费级设备(Garmin Forerunner 55s),评估了设备特定的压力检测性能。35名学生在实验室环境中完成了三项标准化压力诱发任务。Biopac MP160表现最佳,符合我们对其作为金标准的预期,但不同设备和模型间的性能存在差异。结合心率变异性(HRV)和EDA数据在多数场景下提升了压力预测效果。然而,Empatica E4表现出波动性:虽然HRV和EDA数据在留一被试(LOSO)评估中改善了压力检测(AUROC最高达0.953),但设备特定的局限性导致其在我们预训练的压力检测工具测试中表现欠佳(AUROC为0.723),凸显了硬件与模型兼容性相关的泛化挑战。Garmin Forerunner 55s展现出在实际场景中进行压力监测的强大潜力,在LOSO评估中实现了最佳的心算压力检测性能(AUROC最高达0.961),与研究级设备如Polar H10(AUROC 0.954)和Empatica E4(仅HRV模型AUROC 0.905,HRV+EDA模型AUROC 0.953)相当,并兼具消费级设备在自由生活场景中友好佩戴的附加优势。