Reliable stress recognition is critical in applications such as medical monitoring and safety-critical systems, including real-world driving. While stress is commonly detected using physiological signals such as perinasal perspiration and heart rate, facial activity provides complementary cues that can be captured unobtrusively from video. We propose a multimodal stress estimation framework that combines facial videos and physiological signals, remaining effective even when biosignal acquisition is challenging. Facial behavior is represented using a dense 3D Morphable Model, yielding a 56-dimensional descriptor that captures subtle expression and head-pose dynamics over time. To study how stress modulates facial motion, we perform extensive experiments alongside established physiological markers. Paired hypothesis tests between baseline and stressor phases show that 38 of 56 facial components exhibit consistent, phase-specific stress responses comparable to physiological markers. Building on these findings, we introduce a Transformer-based temporal modeling framework and evaluate unimodal, early-fusion, and cross-modal attention strategies. Cross-modal attention fusion of 3D-derived facial features with physiological signals substantially improves performance over physiological signals alone, increasing AUROC from 52.7% and accuracy from 51.0% to 92.0% and 86.7%, respectively. Although evaluated on driving data, the proposed framework and protocol may generalize to other stress estimation settings.
翻译:可靠的压力识别在医疗监护和安全关键系统(如真实驾驶场景)中至关重要。虽然压力通常通过生理信号(如鼻周出汗和心率)进行检测,但面部活动提供了可从视频中无干扰捕获的补充线索。我们提出了一种多模态压力估计框架,该框架结合了面部视频与生理信号,即使在生物信号采集困难时仍保持有效。面部行为通过密集三维可形变模型进行表征,生成一个56维描述符,用于捕捉随时间变化的细微表情与头部姿态动态。为研究压力如何调节面部运动,我们与成熟的生理标记物进行了广泛对比实验。基线阶段与压力源阶段间的配对假设检验表明,56个面部成分中有38个表现出与生理标记物相当的、一致的阶段特异性压力响应。基于这些发现,我们引入了一种基于Transformer的时序建模框架,并评估了单模态、早期融合及跨模态注意力策略。将三维衍生面部特征与生理信号进行跨模态注意力融合,相比仅使用生理信号,性能显著提升:AUROC从52.7%提高至92.0%,准确率从51.0%提升至86.7%。尽管在驾驶数据上评估,所提出的框架与协议可能推广至其他压力估计场景。