Open-source Large Language Models (LLMs) have recently gained popularity because of their comparable performance to proprietary LLMs. To efficiently fulfill domain-specialized tasks, open-source LLMs can be refined, without expensive accelerators, using low-rank adapters. However, it is still unknown whether low-rank adapters can be exploited to control LLMs. To address this gap, we demonstrate that an infected adapter can induce, on specific triggers, an LLM to output content defined by an adversary and to even maliciously use tools. To train a Trojan adapter, we propose two novel attacks, POLISHED and FUSION, that improve over prior approaches. POLISHED uses LLM-enhanced paraphrasing to polish benchmark poisoned datasets. In contrast, in the absence of a dataset, FUSION leverages an over-poisoning procedure to transform a benign adaptor. Our experiments validate that our attacks provide higher attack effectiveness than the baseline and, for the purpose of attracting downloads, preserves or improves the adapter's utility. Finally, we provide two case studies to demonstrate that the Trojan adapter can lead a LLM-powered autonomous agent to execute unintended scripts or send phishing emails. Our novel attacks represent the first study of supply chain threats for LLMs through the lens of Trojan plugins.
翻译:开源大语言模型(LLMs)因性能媲美闭源模型而近期备受关注。为高效完成领域专精任务,开源LLMs可通过低秩适配器进行优化,无需昂贵的加速器。然而,低秩适配器是否会被利用以控制LLMs仍属未知。为填补这一空白,我们证明受感染的适配器可在特定触发器下诱导LLMs输出攻击者定义的内容,甚至恶意使用工具。为训练木马适配器,我们提出两种新型攻击方法——POLISHED与FUSION——其性能优于现有方案。POLISHED利用LLM增强的释义技术对基准中毒数据集进行精炼;而FUSION在无数据集时,通过过中毒机制将良性适配器转化。实验验证表明,我们的攻击方法相比基线具有更高的攻击有效性,且为吸引用户下载,其适配器效能得以保持甚至提升。最后通过两个案例研究证明,木马适配器可引导基于LLM的自主代理执行未预期脚本或发送钓鱼邮件。本研究首次通过木马插件的视角揭示了LLM供应链威胁。