The evolution of large language models into autonomous agents introduces adversarial failures that exploit legitimate tool privileges, transforming safety evaluation in tool-augmented environments from a subjective NLP task into an objective control problem. We formalize this threat model as Tag-Along Attacks: a scenario where a tool-less adversary "tags along" on the trusted privileges of a safety-aligned Operator to induce prohibited tool use through conversation alone. To validate this threat, we present Slingshot, a 'cold-start' reinforcement learning framework that autonomously discovers emergent attack vectors, revealing a critical insight: in our setting, learned attacks tend to converge to short, instruction-like syntactic patterns rather than multi-turn persuasion. On held-out extreme-difficulty tasks, Slingshot achieves a 67.0% success rate against a Qwen2.5-32B-Instruct-AWQ Operator (vs. 1.7% baseline), reducing the expected attempts to first success (on solved tasks) from 52.3 to 1.3. Crucially, Slingshot transfers zero-shot to several model families, including closed-source models like Gemini 2.5 Flash (56.0% attack success rate) and defensive-fine-tuned open-source models like Meta-SecAlign-8B (39.2% attack success rate). Our work establishes Tag-Along Attacks as a first-class, verifiable threat model and shows that effective agentic attacks can be elicited from off-the-shelf open-weight models through environment interaction alone.
翻译:大型语言模型向自主智能体的演进引入了利用合法工具权限的对抗性失效,将工具增强环境中的安全性评估从主观的自然语言处理任务转变为客观的控制问题。我们将此威胁模型形式化为"跟随攻击":一种无工具的对手仅通过对话"跟随"安全对齐操作者的受信权限,从而诱导被禁止工具使用的场景。为验证此威胁,我们提出了Slingshot——一个"冷启动"强化学习框架,能够自主发现涌现的攻击向量,揭示了一个关键洞见:在我们的设定中,习得的攻击倾向于收敛于简短、指令式的句法模式,而非多轮说服。在保留的极端难度任务上,Slingshot对Qwen2.5-32B-Instruct-AWQ操作者的攻击成功率达到67.0%(基线为1.7%),将首次成功所需预期尝试次数(在已解决任务上)从52.3次降低至1.3次。关键的是,Slingshot能够零样本迁移至多个模型家族,包括闭源模型如Gemini 2.5 Flash(56.0%攻击成功率)和经过防御性微调的开源模型如Meta-SecAlign-8B(39.2%攻击成功率)。本研究确立了跟随攻击作为一类可验证的一级威胁模型,并证明仅通过环境交互即可从现成的开放权重模型中引发有效的智能体攻击。