LLM-based agents execute real-world workflows via tools and memory. These affordances enable ill-intended adversaries to also use these agents to carry out complex misuse scenarios. Existing agent misuse benchmarks largely test single-prompt instructions, leaving a gap in measuring how agents end up helping with harmful or illegal tasks over multiple turns. We introduce STING (Sequential Testing of Illicit N-step Goal execution), an automated red-teaming framework that constructs a step-by-step illicit plan grounded in a benign persona and iteratively probes a target agent with adaptive follow-ups, using judge agents to track phase completion. We further introduce an analysis framework that models multi-turn red-teaming as a time-to-first-jailbreak random variable, enabling analysis tools like discovery curves, hazard-ratio attribution by attack language, and a new metric: Restricted Mean Jailbreak Discovery. Across AgentHarm scenarios, STING yields substantially higher illicit-task completion than single-turn prompting and chat-oriented multi-turn baselines adapted to tool-using agents. In multilingual evaluations across six non-English settings, we find that attack success and illicit-task completion do not consistently increase in lower-resource languages, diverging from common chatbot findings. Overall, STING provides a practical way to evaluate and stress-test agent misuse in realistic deployment settings, where interactions are inherently multi-turn and often multilingual.
翻译:基于大语言模型的代理通过工具和记忆执行现实世界的工作流程。这些功能使得心怀恶意的攻击者也能利用这些代理实施复杂的滥用场景。现有的代理滥用基准主要测试单轮指令,在衡量代理如何通过多轮交互最终协助完成有害或非法任务方面存在空白。我们提出了STING(非法多步目标执行序列测试),这是一个自动化红队测试框架,它基于良性人设构建逐步的非法计划,并通过自适应后续提问迭代探测目标代理,同时使用评判代理跟踪阶段完成情况。我们进一步引入了一个分析框架,将多轮红队测试建模为首次越狱时间随机变量,从而支持诸如发现曲线、按攻击语言划分的风险比归因分析以及新指标——限制平均越狱发现率等分析工具。在AgentHarm场景中,STING实现的非法任务完成率显著高于单轮提示方法以及为工具使用代理调整的面向聊天的多轮基线。在涵盖六种非英语环境的跨语言评估中,我们发现攻击成功率和非法任务完成率并未在资源较少的语言中持续上升,这与常见聊天机器人的研究结论存在差异。总体而言,STING为评估和压力测试现实部署环境中的代理滥用提供了实用方法,这些环境中的交互本质上是多轮且通常涉及多语言的。