Artificial Intelligence (AI) is revolutionizing scientific research, yet its growing integration into laboratory environments presents critical safety challenges. Large language models (LLMs) and vision language models (VLMs) now assist in experiment design and procedural guidance, yet their "illusion of understanding" may lead researchers to overtrust unsafe outputs. Here we show that current models remain far from meeting the reliability needed for safe laboratory operation. We introduce LabSafety Bench, a comprehensive benchmark that evaluates models on hazard identification, risk assessment, and consequence prediction across 765 multiple-choice questions and 404 realistic lab scenarios, encompassing 3,128 open-ended tasks. Evaluations on 19 advanced LLMs and VLMs show that no model evaluated on hazard identification surpasses 70% accuracy. While proprietary models perform well on structured assessments, they do not show a clear advantage in open-ended reasoning. These results underscore the urgent need for specialized safety evaluation frameworks before deploying AI systems in real laboratory settings.
翻译:人工智能(AI)正在彻底改变科学研究,但其在实验室环境中日益增长的融合带来了关键的安全挑战。大语言模型(LLM)和视觉语言模型(VLM)目前可协助实验设计和流程指导,然而其“理解幻觉”可能导致研究人员过度信任不安全的输出。本文表明,当前模型远未达到安全实验室操作所需的可靠性水平。我们提出了LabSafety Bench,这是一个综合性基准,通过765道选择题和404个真实实验室场景(涵盖3,128项开放式任务)来评估模型在危险识别、风险评估和后果预测方面的能力。对19个先进LLM和VLM的评估显示,在危险识别任务上,没有任何模型的准确率超过70%。虽然专有模型在结构化评估中表现良好,但在开放式推理方面并未显示出明显优势。这些结果强调了在真实实验室环境中部署AI系统之前,迫切需要专门的安全评估框架。