Large language models (LLMs) are often fine-tuned on uncurated text datasets that adversaries can poison. Existing poisoning attacks primarily rely on fixed trigger phrases that defenses such as outlier detection, clean-data regularization, or online monitoring can neutralize. In this paper, we propose a data poisoning method that teaches an LLM an information hiding scheme reliably and stealthily through semantic associations between shared knowledge such as facts or concepts and attacker-chosen phrases. The induced hiding scheme can encode and decode arbitrary malicious instructions, thus revealing a new and subtle poisoning-induced vulnerability: covert control attacks. We precisely characterize covert control attacks and evaluate them across $5$ LLMs, $3$ backdoor defenses, and $4$ prompt injection defenses. With a small poisoned fraction, covert control attacks outperform heuristic-based prompt injection attacks in average attack success rate by about $40\%$ relative to clean fine-tuned models. They also circumvent defenses based on detection and fine-tuning, maintaining up to $93\%$ attack success rate after backdoor defenses and up to $98\%$ after prompt injection defenses.
翻译:大型语言模型(LLMs)通常会在未经过人工筛选的文本数据集上进行微调,而攻击者能够对这些数据集进行投毒。现有的投毒攻击主要依赖固定的触发短语,异常检测、清洁数据正则化或在线监控等防御方法可以消除这类攻击。本文提出了一种数据投毒方法,通过事实或概念等共享知识与攻击者选择的短语之间的语义关联,使LLM可靠且隐蔽地学会一种信息隐藏方案。这种诱导出的隐藏方案能够编码和解码任意恶意指令,从而揭示了一种新型且隐蔽的投毒诱导漏洞:隐蔽控制攻击。我们精确刻画了隐蔽控制攻击,并在5个LLM、3种后门防御和4种提示注入防御下对其进行了评估。在较小的投毒比例下,隐蔽控制攻击的平均攻击成功率相比清洁微调模型约高出40%,超越了基于启发式的提示注入攻击。同时,该攻击能够规避基于检测和微调的防御,在后门防御后仍能保持高达93%的攻击成功率,在提示注入防御后则高达98%。