Can We Trust Embodied Agents? Exploring Backdoor Attacks against Embodied LLM-based Decision-Making Systems

Large Language Models (LLMs) have shown significant promise in real-world decision-making tasks for embodied artificial intelligence, especially when fine-tuned to leverage their inherent common sense and reasoning abilities while being tailored to specific applications. However, this fine-tuning process introduces considerable safety and security vulnerabilities, especially in safety-critical cyber-physical systems. In this work, we propose the first comprehensive framework for Backdoor Attacks against LLM-based Decision-making systems (BALD) in embodied AI, systematically exploring the attack surfaces and trigger mechanisms. Specifically, we propose three distinct attack mechanisms: word injection, scenario manipulation, and knowledge injection, targeting various components in the LLM-based decision-making pipeline. We perform extensive experiments on representative LLMs (GPT-3.5, LLaMA2, PaLM2) in autonomous driving and home robot tasks, demonstrating the effectiveness and stealthiness of our backdoor triggers across various attack channels, with cases like vehicles accelerating toward obstacles and robots placing knives on beds. Our word and knowledge injection attacks achieve nearly 100% success rate across multiple models and datasets while requiring only limited access to the system. Our scenario manipulation attack yields success rates exceeding 65%, reaching up to 90%, and does not require any runtime system intrusion. We also assess the robustness of these attacks against defenses, revealing their resilience. Our findings highlight critical security vulnerabilities in embodied LLM systems and emphasize the urgent need for safeguarding these systems to mitigate potential risks.

翻译：大型语言模型（LLM）在具身人工智能的实际决策任务中展现出巨大潜力，尤其是在经过微调后，能够利用其固有的常识与推理能力，并适应特定应用场景。然而，这种微调过程也引入了显著的安全与安防漏洞，特别是在安全攸关的信息物理系统中。本研究首次提出了针对具身AI中基于LLM的决策系统的后门攻击综合框架（BALD），系统性地探索了攻击面与触发机制。具体而言，我们提出了三种不同的攻击机制：词注入、场景操控与知识注入，分别针对基于LLM的决策流程中的不同组件。我们在自动驾驶和家庭机器人任务中对代表性LLM（GPT-3.5、LLaMA2、PaLM2）进行了广泛实验，证明了我们的后门触发器在不同攻击通道中的有效性与隐蔽性，例如车辆加速冲向障碍物、机器人将刀具放置于床上等案例。我们的词注入与知识注入攻击在多种模型与数据集上实现了接近100%的成功率，且仅需有限的系统访问权限。场景操控攻击的成功率超过65%，最高可达90%，且无需任何运行时系统侵入。我们还评估了这些攻击针对防御措施的鲁棒性，揭示了其强韧性。我们的研究结果凸显了具身LLM系统中严峻的安全漏洞，并强调了保障这些系统以降低潜在风险的迫切需求。