Large language models are increasingly deployed as *deep agents* that plan, maintain persistent state, and invoke external tools, shifting safety failures from unsafe text to unsafe *trajectories*. We introduce **AgentFence**, an architecture-centric security evaluation that defines 14 trust-boundary attack classes spanning planning, memory, retrieval, tool use, and delegation, and detects failures via *trace-auditable conversation breaks* (unauthorized or unsafe tool use, wrong-principal actions, state/objective integrity violations, and attack-linked deviations). Holding the base model fixed, we evaluate eight agent archetypes under persistent multi-turn interaction and observe substantial architectural variation in mean security break rate (MSBR), ranging from $0.29 \pm 0.04$ (LangGraph) to $0.51 \pm 0.07$ (AutoGPT). The highest-risk classes are operational: Denial-of-Wallet ($0.62 \pm 0.08$), Authorization Confusion ($0.54 \pm 0.10$), Retrieval Poisoning ($0.47 \pm 0.09$), and Planning Manipulation ($0.44 \pm 0.11$), while prompt-centric classes remain below $0.20$ under standard settings. Breaks are dominated by boundary violations (SIV 31%, WPA 27%, UTI+UTA 24%, ATD 18%), and authorization confusion correlates with objective and tool hijacking ($ρ\approx 0.63$ and $ρ\approx 0.58$). AgentFence reframes agent security around what matters operationally: whether an agent stays within its goal and authority envelope over time.
翻译:大型语言模型正越来越多地被部署为能够规划、维持持久状态并调用外部工具的*深度智能体*,这使得安全失效从非安全文本转向非安全*轨迹*。我们提出**AgentFence**,一种以架构为中心的安全评估框架,其定义了涵盖规划、记忆、检索、工具使用和委托的14种信任边界攻击类别,并通过*轨迹可审计的对话中断*(未经授权或不安全的工具使用、错误主体的操作、状态/目标完整性破坏以及与攻击相关的行为偏差)来检测失效。在保持基础模型固定的条件下,我们对八种智能体架构原型在持久多轮交互下进行评估,观察到平均安全中断率(MSBR)存在显著的架构间差异,范围从 $0.29 \pm 0.04$(LangGraph)到 $0.51 \pm 0.07$(AutoGPT)。风险最高的类别是操作性的:拒绝钱包攻击($0.62 \pm 0.08$)、授权混淆($0.54 \pm 0.10$)、检索污染($0.47 \pm 0.09$)和规划操纵($0.44 \pm 0.11$),而提示词中心类别的风险在标准设置下则低于 $0.20$。安全中断主要由边界违反主导(状态完整性违反 31%、错误主体操作 27%、未授权工具调用/使用 24%、攻击触发偏差 18%),且授权混淆与目标劫持和工具劫持相关($ρ\approx 0.63$ 和 $ρ\approx 0.58$)。AgentFence 将智能体安全重新聚焦于操作上至关重要的方面:智能体是否能在其目标和权限范围内持续运行。