Benchmarking the Safety of Large Language Models for Robotic Health Attendant Control

Large language models (LLMs) are increasingly considered for deployment as the control component of robotic health attendants, yet their safety in this context remains poorly characterized. We introduce a dataset of 270 harmful instructions spanning nine prohibited behavior categories grounded in the American Medical Association Principles of Medical Ethics, and use it to evaluate 72 LLMs in a simulation environment based on the Robotic Health Attendant framework. The mean violation rate across all models was 54.4\%, with more than half exceeding 50\%, and violation rates varied substantially across behavior categories, with superficially plausible instructions such as device manipulation and emergency delay proving harder to refuse than overtly destructive ones. Model size and release date were the primary determinants of safety performance among open-weight models, and proprietary models were substantially safer than open-weight counterparts (median 23.7\% versus 72.8\%). Medical domain fine-tuning conferred no significant overall safety benefit, and a prompt-based defense strategy produced only a modest reduction in violation rates among the least safe models, leaving absolute violation rates at levels that would preclude safe clinical deployment. These findings demonstrate that safety evaluation must be treated as a first-class criterion in the development and deployment of LLMs for robotic health attendants.

翻译：大语言模型（LLM）正被越来越多地考虑部署为机器人健康护理员的控制组件，但其在此场景下的安全性仍未得到充分表征。我们构建了一个包含270条有害指令的数据集，涵盖基于美国医学会医学伦理原则的九类禁止行为，并在基于机器人健康护理员框架的仿真环境中评估了72个大语言模型。所有模型平均违规率为54.4%，超过半数模型违规率高于50%，且不同行为类别间违规率差异显著：诸如设备操纵和紧急延迟等表面合理的指令比明显破坏性指令更难被拒绝。模型规模与发布年份是开源权重模型安全性能的主要决定因素，而闭源模型安全性远高于开源权重模型（中位数23.7%对72.8%）。医学领域微调未带来显著整体安全收益，基于提示的防御策略仅使最不安全模型的违规率小幅降低，绝对违规率仍达到阻碍临床安全部署的水平。这些发现表明，在面向机器人健康护理员的大语言模型开发与部署中，必须将安全性评估作为首要标准。

相关内容

健康

关注 27

健康是指一个人在身体、精神和社会等方面都处于良好的状态。健康包括两个方面的内容：

一是主要脏器无疾病，身体形态发育良好，体形均匀，人体各系统具有良好的生理功能，有较强的身体活动能力和劳动能力，这是对健康最基本的要求；

二是对疾病的抵抗能力较强，能够适应环境变化，各种生理刺激以及致病因素对身体的作用。传统的健康观是“无病即健康”，现代人的健康观是整体健康，世界卫生组织提出“健康不仅是躯体没有疾病，还要具备心理健康、社会适应良好和有道德”。因此，现代人的健康内容包括：躯体健康、心理健康、心灵健康、社会健康、智力健康、道德健康、环境健康等。健康是人的基本权利。健康是人生的第一财富。

基于大语言模型的医疗推理研究：综述与 MR-Bench 基准测试

专知会员服务

16+阅读 · 4月13日

158页！天大等最新《大型语言模型安全：全面综述》

专知会员服务

50+阅读 · 2024年12月24日

大型语言模型代理的安全与隐私综述

专知会员服务

30+阅读 · 2024年8月5日

生成式人工智能大型语言模型的安全性：概述

专知会员服务

35+阅读 · 2024年7月30日