越狱LLM控制的机器人 (Jailbreaking LLM-Controlled Robots)

The recent introduction of large language models (LLMs) has revolutionized the field of robotics by enabling contextual reasoning and intuitive human-robot interaction in domains as varied as manipulation, locomotion, and self-driving vehicles. When viewed as a stand-alone technology, LLMs are known to be vulnerable to jailbreaking attacks, wherein malicious prompters elicit harmful text by bypassing LLM safety guardrails. To assess the risks of deploying LLMs in robotics, in this paper, we introduce RoboPAIR, the first algorithm designed to jailbreak LLM-controlled robots. Unlike existing, textual attacks on LLM chatbots, RoboPAIR elicits harmful physical actions from LLM-controlled robots, a phenomenon we experimentally demonstrate in three scenarios: (i) a white-box setting, wherein the attacker has full access to the NVIDIA Dolphins self-driving LLM, (ii) a gray-box setting, wherein the attacker has partial access to a Clearpath Robotics Jackal UGV robot equipped with a GPT-4o planner, and (iii) a black-box setting, wherein the attacker has only query access to the GPT-3.5-integrated Unitree Robotics Go2 robot dog. In each scenario and across three new datasets of harmful robotic actions, we demonstrate that RoboPAIR, as well as several static baselines, finds jailbreaks quickly and effectively, often achieving 100% attack success rates. Our results reveal, for the first time, that the risks of jailbroken LLMs extend far beyond text generation, given the distinct possibility that jailbroken robots could cause physical damage in the real world. Indeed, our results on the Unitree Go2 represent the first successful jailbreak of a deployed commercial robotic system. Addressing this emerging vulnerability is critical for ensuring the safe deployment of LLMs in robotics. Additional media is available at: https://robopair.org

翻译：近期大型语言模型（LLM）的引入彻底改变了机器人学领域，通过在操控、运动、自动驾驶车辆等多种应用场景中实现情境推理和直观的人机交互。当被视为独立技术时，已知LLM容易受到越狱攻击，即恶意提示者通过绕过LLM安全护栏来诱导生成有害文本。为评估在机器人系统中部署LLM的风险，本文提出了首个专为越狱LLM控制机器人而设计的算法RoboPAIR。与现有针对LLM聊天机器人的文本攻击不同，RoboPAIR能够诱导LLM控制的机器人执行有害物理动作，我们在三种实验场景中验证了这一现象：（i）白盒场景：攻击者完全掌握NVIDIA Dolphins自动驾驶LLM；（ii）灰盒场景：攻击者部分访问配备GPT-4o规划器的Clearpath Robotics Jackal UGV机器人；（iii）黑盒场景：攻击者仅能查询集成GPT-3.5的Unitree Robotics Go2机器狗。在每种场景下，基于三个新的有害机器人动作数据集，我们证明RoboPAIR及若干静态基线方法能快速有效地实现越狱，攻击成功率常达100%。我们的研究首次揭示：鉴于越狱机器人可能在现实世界造成物理损害，LLM越狱的风险已远超文本生成范畴。事实上，我们在Unitree Go2上取得的成果代表了首次对已部署商业机器人系统的成功越狱。解决这一新兴漏洞对于保障LLM在机器人领域的安全部署至关重要。更多资料请访问：https://robopair.org