One mistake by an AI system in a safety-critical setting can cost lives. As Large Language Models (LLMs) become integral to robotics decision-making, the physical dimension of risk grows; a single wrong instruction can directly endanger human safety. This paper addresses the urgent need to systematically evaluate LLM performance in scenarios where even minor errors are catastrophic. Through a qualitative evaluation of a fire evacuation scenario, we identified critical failure cases in LLM-based decision-making. Based on these, we designed seven tasks for quantitative assessment, categorized into: Complete Information, Incomplete Information, and Safety-Oriented Spatial Reasoning (SOSR). Complete information tasks utilize ASCII maps to minimize interpretation ambiguity and isolate spatial reasoning from visual processing. Incomplete information tasks require models to infer missing context, testing for spatial continuity versus hallucinations. SOSR tasks use natural language to evaluate safe decision-making in life-threatening contexts. We benchmark various LLMs and Vision-Language Models (VLMs) across these tasks. Beyond aggregate performance, we analyze the implications of a 1% failure rate, highlighting how "rare" errors escalate into catastrophic outcomes. Results reveal serious vulnerabilities: several models achieved a 0% success rate in ASCII navigation, while in a simulated fire drill, models instructed robots to move toward hazardous areas instead of emergency exits. Our findings lead to a sobering conclusion: current LLMs are not ready for direct deployment in safety-critical systems. A 99% accuracy rate is dangerously misleading in robotics, as it implies one out of every hundred executions could result in catastrophic harm. We demonstrate that even state-of-the-art models cannot guarantee safety, and absolute reliance on them creates unacceptable risks.
翻译:在安全关键场景中,人工智能系统的一次失误就可能付出生命的代价。随着大语言模型(LLMs)日益融入机器人决策,风险的物理维度也随之扩大;一条错误的指令可能直接危及人身安全。本文旨在应对迫切需求,系统性地评估大语言模型在即使微小错误也可能导致灾难性后果的场景中的表现。通过对火灾疏散场景的定性评估,我们识别了基于大语言模型的决策中的关键失效案例。基于此,我们设计了七项任务进行定量评估,并将其分类为:完全信息任务、不完全信息任务以及面向安全的空间推理(SOSR)任务。完全信息任务使用ASCII地图来最小化解读歧义,并将空间推理与视觉处理隔离开来。不完全信息任务要求模型推断缺失的上下文,以测试空间连续性与幻觉。SOSR任务使用自然语言来评估在危及生命的背景下做出安全决策的能力。我们在这些任务上对多种大语言模型和视觉语言模型(VLMs)进行了基准测试。除了总体性能,我们还分析了1%失败率的影响,揭示了“罕见”错误如何升级为灾难性后果。结果揭示了严重的脆弱性:多个模型在ASCII导航任务中取得了0%的成功率;而在模拟消防演习中,模型指示机器人向危险区域而非紧急出口移动。我们的发现得出了一个发人深省的结论:当前的大语言模型尚未准备好直接部署于安全关键系统。在机器人领域,99%的准确率具有危险的误导性,因为它意味着每百次执行中就可能有一次导致灾难性伤害。我们证明,即使是最先进的模型也无法保证安全,绝对依赖它们会带来不可接受的风险。