Multimodal Large Language Models are increasingly adopted as autonomous agents in interactive environments, yet their ability to proactively address safety hazards remains insufficient. We introduce SafetyALFRED, built upon the embodied agent benchmark ALFRED, augmented with six categories of real-world kitchen hazards. While existing safety evaluations focus on hazard recognition through disembodied question answering (QA) settings, we evaluate eleven state-of-the-art models from the Qwen, Gemma, and Gemini families on not only hazard recognition, but also active risk mitigation through embodied planning. Our experimental results reveal a significant alignment gap: while models can accurately recognize hazards in QA settings, average mitigation success rates for these hazards are low in comparison. Our findings demonstrate that static evaluations through QA are insufficient for physical safety, thus we advocate for a paradigm shift toward benchmarks that prioritize corrective actions in embodied contexts. We open-source our code and dataset under https://github.com/sled-group/SafetyALFRED.git
翻译:多模态大语言模型在交互式环境中作为自主代理的应用日益广泛,但其主动应对安全风险的能力仍显不足。我们提出SafetyALFRED,该基准基于具身代理基准ALFRED构建,并扩展了六类真实厨房场景中的危险类型。现有安全评估方法多通过非具身问答设置聚焦危险识别,而我们不仅评估来自Qwen、Gemma和Gemini系列的十一个最先进模型在危险识别方面的表现,更通过具身规划评估其主动风险缓解能力。实验结果表明存在显著的认知-行动差距:尽管模型能够在问答设置中准确识别危险,但这些危险的缓解成功率平均值相对较低。我们的发现证明,通过问答进行的静态评估不足以衡量物理安全性,因此我们倡导范式转向以具身情境中的纠正性动作为核心的基准评估。我们已将代码和数据集开源至https://github.com/sled-group/SafetyALFRED.git