Large language models (LLMs) are increasingly used in emergency first response (EFR) applications to support situational awareness (SA) and decision-making, yet most operate on text or 2D imagery and offer little support for core EFR SA competencies like spatial reasoning. We address this gap by evaluating a prototype that fuses robot-mounted depth sensing and YOLO detection with a vision language model (VLM) capable of verbalizing metrically-grounded distances of detected objects (e.g., the chair is 3.02 meters away). In a mixed-reality toxic-smoke scenario, participants estimated distances to a victim and an exit window under three conditions: video-only, depth-agnostic VLM, and depth-augmented VLM. Depth-augmentation improved objective accuracy and stability, e.g., the victim and window distance estimation error dropped, while raising situational awareness without increasing workload. Conversely, depth- agnostic assistance increased workload and slightly worsened accuracy. We contribute to human SA augmentation by demonstrating that metrically grounded, object-centric verbal information supports spatial reasoning in EFR and improves decision-relevant judgments under time pressure.
翻译:大型语言模型(LLMs)在应急响应(EFR)应用中日益普及,用于支持态势感知(SA)与决策制定,然而多数模型仅处理文本或二维图像,对EFR中空间推理等核心SA能力的支持极为有限。本研究通过评估一个融合机器人搭载深度感知与YOLO检测的原型系统来填补这一空白,该系统结合了能够以度量基准描述检测物体距离的视觉语言模型(VLM)(例如“椅子距离3.02米”)。在一个混合现实的有毒烟雾场景中,参与者在三种条件下估算至受困者与出口窗口的距离:仅视频、无深度感知的VLM、以及深度增强的VLM。深度增强显著提升了客观准确性与稳定性——例如对受困者与窗口的距离估算误差下降,同时在未增加工作负荷的前提下提高了态势感知水平。相反,无深度感知的辅助方式增加了工作负荷并轻微降低了准确性。本研究通过证明基于度量基准、以物体为中心的言语信息能够支持EFR中的空间推理,并在时间压力下改善与决策相关的判断,为人类态势感知增强领域作出了贡献。