Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

LLM-based agents execute real-world workflows via tools and memory. These affordances enable ill-intended adversaries to also use these agents to carry out complex misuse scenarios. Existing agent misuse benchmarks largely test single-prompt instructions, leaving a gap in measuring how agents end up helping with harmful or illegal tasks over multiple turns. We introduce STING (Sequential Testing of Illicit N-step Goal execution), an automated red-teaming framework that constructs a step-by-step illicit plan grounded in a benign persona and iteratively probes a target agent with adaptive follow-ups, using judge agents to track phase completion. We further introduce an analysis framework that models multi-turn red-teaming as a time-to-first-jailbreak random variable, enabling analysis tools like discovery curves, hazard-ratio attribution by attack language, and a new metric: Restricted Mean Jailbreak Discovery. Across AgentHarm scenarios, STING yields substantially higher illicit-task completion than single-turn prompting and chat-oriented multi-turn baselines adapted to tool-using agents. In multilingual evaluations across six non-English settings, we find that attack success and illicit-task completion do not consistently increase in lower-resource languages, diverging from common chatbot findings. Overall, STING provides a practical way to evaluate and stress-test agent misuse in realistic deployment settings, where interactions are inherently multi-turn and often multilingual.

翻译：基于大语言模型的智能体通过工具和记忆执行现实世界的工作流。这些功能也使意图不良的攻击者能够利用这些智能体实施复杂的滥用场景。现有的智能体滥用基准主要测试单轮提示指令，在衡量智能体如何在多轮交互中最终协助完成有害或非法任务方面存在空白。我们提出了STING（非法N步目标执行序列测试），这是一个自动化的红队测试框架，它基于一个良性人设构建分步的非法计划，并使用自适应后续提问迭代探测目标智能体，同时利用评判智能体跟踪阶段完成情况。我们进一步引入了一个分析框架，将多轮红队测试建模为一个首次越狱时间随机变量，从而支持诸如发现曲线、按攻击语言归因的风险比以及一个新指标——受限平均越狱发现率等分析工具。在AgentHarm场景中，STING的非法任务完成率显著高于单轮提示以及为工具使用型智能体调整的、面向聊天的多轮基线方法。在涵盖六种非英语环境的跨语言评估中，我们发现攻击成功率和非法任务完成率并未在资源较少的语言中持续上升，这与常见的聊天机器人研究结论存在差异。总体而言，STING为评估和压力测试智能体在现实部署环境中的滥用提供了一种实用方法，这些环境中的交互本质上是多轮且常常是多语言的。