In the realm of Virtual Reality (VR) and Human-Computer Interaction (HCI), real-time emotion recognition shows promise for supporting individuals with Autism Spectrum Disorder (ASD) in improving social skills. This task requires a strict latency-accuracy trade-off, with motion-to-photon (MTP) latency kept below 140 ms to maintain contingency. However, most off-the-shelf Deep Learning models prioritize accuracy over the strict timing constraints of commodity hardware. As a first step toward accessible VR therapy, we benchmark State-of-the-Art (SOTA) models for Zero-Shot Facial Expression Recognition (FER) on virtual characters using the UIBVFED dataset. We evaluate Medium and Nano variants of YOLO (v8, v11, and v12) for face detection, alongside general-purpose Vision Transformers including CLIP, SigLIP, and ViT-FER.Our results on CPU-only inference demonstrate that while face detection on stylized avatars is robust (100% accuracy), a "Latency Wall" exists in the classification stage. The YOLOv11n architecture offers the optimal balance for detection (~54 ms). However, general-purpose Transformers like CLIP and SigLIP fail to achieve viable accuracy (<23%) or speed (>150 ms) for real-time loops. This study highlights the necessity for lightweight, domain-specific architectures to enable accessible, real-time AI in therapeutic settings.
翻译:在虚拟现实与人机交互领域,实时情感识别技术有望帮助自闭症谱系障碍患者提升社交技能。该任务需严格权衡延迟与精度,必须将动作到光子延迟控制在140毫秒以下以维持交互连续性。然而,大多数现成的深度学习模型优先考虑精度,却忽视了消费级硬件的严格时序约束。作为实现可及性VR治疗的第一步,我们基于UIBVFED数据集对虚拟角色零样本面部表情识别的先进模型进行基准测试。我们评估了YOLO系列的中型和纳米变体用于面部检测,同时测试了包括CLIP、SigLIP和ViT-FER在内的通用视觉Transformer模型。在纯CPU推理环境下的实验结果表明:虽然风格化虚拟化身的面部检测具有鲁棒性,但分类阶段存在“延迟墙”。YOLOv11n架构在检测阶段实现了最佳平衡。然而,CLIP和SigLIP等通用Transformer模型在实时交互循环中既无法达到可用精度,也无法满足速度要求。本研究揭示了开发轻量化领域专用架构的必要性,以推动治疗场景中可及性实时人工智能的实现。