Towards Action Hijacking of Large Language Model-based Agent

In the past few years, intelligent agents powered by large language models (LLMs) have achieved remarkable progress in performing complex tasks. These LLM-based agents receive queries as tasks and decompose them into various subtasks via the equipped LLMs to guide the action of external entities (\eg{}, tools, AI-agents) to answer the questions from users. Empowered by their exceptional capabilities of understanding and problem-solving, they are widely adopted in labor-intensive sectors including healthcare, finance, code completion, \etc{} At the same time, there are also concerns about the potential misuse of these agents, prompting the built-in safety guards from service providers. To circumvent the built-in guidelines, the prior studies proposed a multitude of attacks including memory poisoning, jailbreak, and prompt injection. These studies often fail to maintain effectiveness across safety filters employed by agents due to the restricted privileges and the harmful semantics in queries. In this paper, we introduce \Name, a novel hijacking attack to manipulate the action plans of black-box agent system. \Name first collects the action-aware memory through prompt theft from long-term memory. It then leverages the internal memory retrieval mechanism of the agent to provide an erroneous context. The huge gap between the latent spaces of the retriever and safety filters allows our method to bypass the detection easily. Extensive experimental results demonstrate the effectiveness of our apporach (\eg{}, 99.67\% ASR). Besides, our approach achieved an average bypass rate of 92.7\% for safety filters.

翻译：过去几年中，基于大型语言模型（LLMs）的智能体在执行复杂任务方面取得了显著进展。这些基于LLM的智能体接收查询作为任务，并通过搭载的LLM将其分解为多个子任务，以引导外部实体（例如工具、AI智能体）的行动来回答用户问题。凭借其卓越的理解与问题解决能力，此类智能体已广泛应用于医疗健康、金融、代码补全等劳动密集型领域。与此同时，人们也担忧这些智能体可能被滥用，促使服务提供商内置了安全防护机制。为规避内置安全准则，先前研究提出了包括记忆污染、越狱攻击和提示注入在内的多种攻击方法。由于查询权限受限及语义危害性，这些攻击方法往往难以在智能体采用的安全过滤器下保持有效性。本文提出一种新型劫持攻击方法，旨在操纵黑盒智能体系统的行动规划。该方法首先通过提示窃取从长期记忆中收集行动感知记忆，随后利用智能体内部记忆检索机制提供错误上下文。检索器与安全过滤器在潜在空间上的巨大差异使得我们的方法能够轻易绕过检测。大量实验结果表明了本方法的有效性（例如达到99.67%的攻击成功率）。此外，我们的方法对安全过滤器实现了平均92.7%的绕过率。