With the prosperity of large language models (LLMs), powerful LLM-based intelligent agents have been developed to provide customized services with a set of user-defined tools. State-of-the-art methods for constructing LLM agents adopt trained LLMs and further fine-tune them on data for the agent task. However, we show that such methods are vulnerable to our proposed backdoor attacks named BadAgent on various agent tasks, where a backdoor can be embedded by fine-tuning on the backdoor data. At test time, the attacker can manipulate the deployed LLM agents to execute harmful operations by showing the trigger in the agent input or environment. To our surprise, our proposed attack methods are extremely robust even after fine-tuning on trustworthy data. Though backdoor attacks have been studied extensively in natural language processing, to the best of our knowledge, we could be the first to study them on LLM agents that are more dangerous due to the permission to use external tools. Our work demonstrates the clear risk of constructing LLM agents based on untrusted LLMs or data. Our code is public at https://github.com/DPamK/BadAgent
翻译:随着大语言模型的蓬勃发展,基于LLM的强大智能体得以开发,通过一组用户自定义工具提供定制化服务。当前构建LLM智能体的先进方法采用预训练LLM,并针对智能体任务数据对其进行微调。然而,我们证明此类方法极易遭受我们提出的后门攻击(BadAgent)——该攻击通过在后门数据上微调即可嵌入后门。在测试阶段,攻击者通过向智能体输入或环境中展示触发器,可操控已部署的LLM智能体执行有害操作。令人惊讶的是,即便经过可信数据微调,我们提出的攻击方法仍具有极强的鲁棒性。尽管后门攻击已在自然语言处理领域被广泛研究,但据我们所知,我们可能是首个针对LLM智能体进行此类研究的团队——由于智能体被授权使用外部工具,其危害性更为严峻。本研究明确揭示了基于不可信LLM或数据构建LLM智能体所存在的显著风险。我们的代码已开源至https://github.com/DPamK/BadAgent。