Safety Not Found (404): Hidden Risks of LLM-Based Robotics Decision Making

One mistake by an AI system in a safety-critical setting can cost lives. As Large Language Models (LLMs) become integral to robotics decision-making, the physical dimension of risk grows; a single wrong instruction can directly endanger human safety. This paper addresses the urgent need to systematically evaluate LLM performance in scenarios where even minor errors are catastrophic. Through a qualitative evaluation of a fire evacuation scenario, we identified critical failure cases in LLM-based decision-making. Based on these, we designed seven tasks for quantitative assessment, categorized into: Complete Information, Incomplete Information, and Safety-Oriented Spatial Reasoning (SOSR). Complete information tasks utilize ASCII maps to minimize interpretation ambiguity and isolate spatial reasoning from visual processing. Incomplete information tasks require models to infer missing context, testing for spatial continuity versus hallucinations. SOSR tasks use natural language to evaluate safe decision-making in life-threatening contexts. We benchmark various LLMs and Vision-Language Models (VLMs) across these tasks. Beyond aggregate performance, we analyze the implications of a 1% failure rate, highlighting how "rare" errors escalate into catastrophic outcomes. Results reveal serious vulnerabilities: several models achieved a 0% success rate in ASCII navigation, while in a simulated fire drill, models instructed robots to move toward hazardous areas instead of emergency exits. Our findings lead to a sobering conclusion: current LLMs are not ready for direct deployment in safety-critical systems. A 99% accuracy rate is dangerously misleading in robotics, as it implies one out of every hundred executions could result in catastrophic harm. We demonstrate that even state-of-the-art models cannot guarantee safety, and absolute reliance on them creates unacceptable risks.

翻译：在安全关键环境中，AI系统的一次失误可能以生命为代价。随着大语言模型（LLM）日益融入机器人决策体系，风险的物理维度随之扩大；一条错误的指令即可直接危及人身安全。本文针对亟需系统评估LLM在微小误差即导致灾难性后果场景中的性能展开研究。通过对火灾疏散场景的定性评估，我们识别出基于LLM决策的关键失效案例。基于此，我们设计了七项定量评估任务，归类为：完全信息任务、不完全信息任务及安全导向空间推理（SOSR）任务。完全信息任务采用ASCII地图以最小化解译歧义，将空间推理与视觉处理分离。不完全信息任务要求模型推断缺失语境，检验空间连续性与幻觉生成。SOSR任务使用自然语言评估危及生命情境下的安全决策能力。我们在这些任务上对多种LLM和视觉语言模型（VLM）进行基准测试。除整体性能外，我们深入分析了1%故障率的影响机制，揭示"罕见"错误如何升级为灾难性后果。结果暴露出严重缺陷：多个模型在ASCII导航任务中达成0%成功率；在模拟消防演练中，模型竟指令机器人朝向危险区域而非紧急出口移动。我们的研究得出警示性结论：当前LLM尚未具备直接部署于安全关键系统的条件。99%的准确率在机器人领域具有严重误导性——这意味着每百次执行就可能出现一次灾难性伤害。研究表明，即使最先进的模型也无法保证绝对安全，对其完全依赖将产生不可接受的风险。