Large language models (LLMs) have been widely adopted in applications such as automated content generation and even critical decision-making systems. However, the risk of prompt injection allows for potential manipulation of LLM outputs. While numerous attack methods have been documented, achieving full control over these outputs remains challenging, often requiring experienced attackers to make multiple attempts and depending heavily on the prompt context. Recent advancements in gradient-based white-box attack techniques have shown promise in tasks like jailbreaks and system prompt leaks. Our research generalizes gradient-based attacks to find a trigger that is (1) Universal: effective irrespective of the target output; (2) Context-Independent: robust across diverse prompt contexts; and (3) Precise Output: capable of manipulating LLM inputs to yield any specified output with high accuracy. We propose a novel method to efficiently discover such triggers and assess the effectiveness of the proposed attack. Furthermore, we discuss the substantial threats posed by such attacks to LLM-based applications, highlighting the potential for adversaries to taking over the decisions and actions made by AI agents.
翻译:大语言模型(LLMs)已广泛应用于自动化内容生成乃至关键决策系统等应用中。然而,提示注入的风险使得LLM输出存在被潜在操纵的可能。尽管已有大量攻击方法被记录在案,但要实现对输出的完全控制仍然具有挑战性,通常需要经验丰富的攻击者进行多次尝试,并且高度依赖于提示的上下文。近期基于梯度的白盒攻击技术在越狱和系统提示泄露等任务中展现出潜力。我们的研究将基于梯度的攻击方法进行泛化,以寻找满足以下条件的触发器:(1)通用性:无论目标输出为何均有效;(2)上下文无关性:在不同的提示上下文中均具有鲁棒性;(3)精确输出:能够操纵LLM输入,以高准确度生成任何指定的输出。我们提出了一种新颖的方法来高效地发现此类触发器,并评估了所提出攻击的有效性。此外,我们讨论了此类攻击对基于LLM的应用构成的重大威胁,强调了攻击者可能接管AI代理所作决策和行动的潜在风险。