Emergency Medical Services (EMS) responders often operate under time-sensitive conditions, facing cognitive overload and inherent risks, requiring essential skills in critical thinking and rapid decision-making. This paper presents CognitiveEMS, an end-to-end wearable cognitive assistant system that can act as a collaborative virtual partner engaging in the real-time acquisition and analysis of multimodal data from an emergency scene and interacting with EMS responders through Augmented Reality (AR) smart glasses. CognitiveEMS processes the continuous streams of data in real-time and leverages edge computing to provide assistance in EMS protocol selection and intervention recognition. We address key technical challenges in real-time cognitive assistance by introducing three novel components: (i) a Speech Recognition model that is fine-tuned for real-world medical emergency conversations using simulated EMS audio recordings, augmented with synthetic data generated by large language models (LLMs); (ii) an EMS Protocol Prediction model that combines state-of-the-art (SOTA) tiny language models with EMS domain knowledge using graph-based attention mechanisms; (iii) an EMS Action Recognition module which leverages multimodal audio and video data and protocol predictions to infer the intervention/treatment actions taken by the responders at the incident scene. Our results show that for speech recognition we achieve superior performance compared to SOTA (WER of 0.290 vs. 0.618) on conversational data. Our protocol prediction component also significantly outperforms SOTA (top-3 accuracy of 0.800 vs. 0.200) and the action recognition achieves an accuracy of 0.727, while maintaining an end-to-end latency of 3.78s for protocol prediction on the edge and 0.31s on the server.
翻译:急救医疗服务(EMS)响应人员常在时间紧迫的条件下工作,面临认知负荷和固有风险,需要具备批判性思维和快速决策的核心技能。本文提出CognitiveEMS——一种端到端的可穿戴认知辅助系统,可作为协作虚拟伙伴,实时获取和分析急救现场的多模态数据,并通过增强现实(AR)智能眼镜与EMS响应人员交互。CognitiveEMS实时处理连续数据流,并利用边缘计算在EMS协议选择和干预识别中提供辅助。我们通过引入三个创新组件解决实时认知辅助中的关键技术挑战:(i)语音识别模型,该模型使用模拟EMS音频记录(结合大语言模型生成的合成数据增强)进行微调,适用于真实医疗急救对话;(ii)EMS协议预测模型,结合当前最先进(SOTA)轻量语言模型与基于图注意力机制的EMS领域知识;(iii)EMS动作识别模块,利用多模态音频与视频数据以及协议预测结果,推断响应人员在事故现场采取的干预/治疗动作。实验结果表明,在对话数据上,本系统语音识别性能优于SOTA(词错误率0.290 vs. 0.618)。协议预测组件同样显著优于SOTA(前三准确率0.800 vs. 0.200),动作识别准确率达0.727,同时在边缘端保持3.78秒的端到端协议预测延迟,在服务器端则为0.31秒。