As Large Language Models (LLMs) evolve from static chatbots into autonomous agents capable of tool execution, the landscape of AI safety is shifting from content moderation to action security. However, existing red-teaming frameworks remain bifurcated: they either focus on rigid, script-based text attacks or lack the architectural modularity to simulate complex, multi-turn agentic exploitations. In this paper, we introduce AJAR (Adaptive Jailbreak Architecture for Red-teaming), a proof-of-concept framework designed to bridge this gap through Protocol-driven Cognitive Orchestration. Built upon the robust runtime of Petri, AJAR leverages the Model Context Protocol (MCP) to decouple adversarial logic from the execution loop, encapsulating state-of-the-art algorithms like X-Teaming as standardized, plug-and-play services. We validate the architectural feasibility of AJAR through a controlled qualitative case study, demonstrating its ability to perform stateful backtracking within a tool-use environment. Furthermore, our preliminary exploration of the "Agentic Gap" reveals a complex safety dynamic: while tool usage introduces new injection vectors via code execution, the cognitive load of parameter formatting can inadvertently disrupt persona-based attacks. AJAR is open-sourced to facilitate the standardized, environment-aware evaluation of this emerging attack surface. The code and data are available at https://github.com/douyipu/ajar.
翻译:随着大型语言模型(LLM)从静态聊天机器人演变为能够执行工具调用的自主智能体,AI安全的研究重心正从内容审核转向行动安全。然而,现有的红队测试框架仍处于割裂状态:它们要么专注于僵化的、基于脚本的文本攻击,要么缺乏能够模拟复杂多轮智能体级攻击的架构模块化能力。本文提出AJAR(面向红队测试的自适应越狱架构),这是一个基于协议驱动认知编排的概念验证框架,旨在弥合上述鸿沟。AJAR基于Petri的强健运行时构建,利用模型上下文协议将对抗逻辑与执行循环解耦,并将X-Teaming等前沿算法封装为标准化的即插即用服务。我们通过受控的定性案例研究验证了AJAR架构的可行性,展示了其在工具使用环境中执行状态化回溯的能力。此外,我们对"智能体能力鸿沟"的初步探索揭示了复杂的安全动态:虽然工具调用通过代码执行引入了新的注入向量,但参数格式化的认知负荷可能无意中破坏基于角色扮演的攻击。AJAR已开源,以促进对这一新兴攻击面进行标准化、环境感知的评估。代码与数据详见https://github.com/douyipu/ajar。