We present the early-stage design and implementation of a multimodal, real-time communication analysis system intended as a foundational interaction layer for adaptive VR training. The system integrates five parallel processing streams: (1) verbal and prosodic speech analysis, (2) skeletal gesture recognition from multi-view RGB cameras, (3) multimodal affective analysis combining lower-face video with upper-face facial EMG, (4) EEG-based mental state decoding, and (5) physiological arousal estimation from skin conductance, heart activity, and proxemic behavior. All signals are synchronized via Lab Streaming Layer to enable temporally aligned, continuous assessments of users' conscious and unconscious communication cues. Building on concepts from social semiotics and symbolic interactionism, we introduce an interpretation layer that links low-level signal representations to interactional constructs such as escalation and de-escalation. This layer is informed by domain knowledge from police instructors and lay participants, grounding system responses in realistic conflict scenarios. We demonstrate the feasibility and limitations of automated cue extraction in an XR-based de-escalation training project for law enforcement, reporting preliminary results for gesture recognition, emotion recognition under HMD occlusion, verbal assessment, mental state decoding, and physiological arousal. Our findings highlight the value of multi-view sensing and multimodal fusion for overcoming occlusion and viewpoint challenges, while underscoring that fusion and feedback must be treated as design problems rather than purely technical ones. The work contributes design resources and empirical insights for shaping human-AI-powered XR training in complex interpersonal settings.
翻译:我们提出了一种多模态实时通信分析系统的早期设计与实现,该系统旨在作为自适应VR训练的基础交互层。系统集成了五个并行处理流:(1)言语与韵律语音分析,(2)基于多视角RGB相机的骨骼姿态识别,(3)结合下脸视频与上脸面部肌电信号的多模态情感分析,(4)基于脑电图的心理状态解码,以及(5)基于皮肤电导、心脏活动和近场行为测量的生理唤醒估计。所有信号通过实验室流层实现同步,从而支持对用户显性与隐性交流线索进行时间对齐的连续评估。基于社会符号学与符号互动论的概念,我们引入了一个解释层,将低层信号表征与升级/去升级等交互性构念相连接。该层融合了警务教官与普通参与者的领域知识,使系统响应扎根于逼真的冲突场景中。我们在一项面向执法部门的基于扩展现实的去升级训练项目中,展示了自动化线索提取的可行性与局限性,并报告了姿态识别、头戴式显示器遮挡下的情绪识别、言语评估、心理状态解码及生理唤醒的初步结果。研究结果凸显了多视角传感与多模态融合对克服遮挡与视角难题的价值,同时强调融合与反馈应被视作设计问题而不仅是技术问题。本工作为在复杂人际场景中塑造人类与人工智能协作的扩展现实训练,提供了设计资源与实证洞见。