The performance of large language model (LLM) agents depends critically on the execution harness, the system layer that orchestrates tool use, context management, and state persistence. Yet this same architectural centrality makes the harness a high-value attack surface: a single compromise at the harness level can cascade through the entire execution pipeline. We observe that existing security approaches suffer from structural mismatch, leaving them blind to harness-internal state and unable to coordinate across the different phases of agent operation. In this paper, we introduce \safeharness{}, a security architecture in which four proposed defense layers are woven directly into the agent lifecycle to address above significant limitations: adversarial context filtering at input processing, tiered causal verification at decision making, privilege-separated tool control at action execution, and safe rollback with adaptive degradation at state update. The proposed cross-layer mechanisms tie these layers together, escalating verification rigor, triggering rollbacks, and tightening tool privileges whenever sustained anomalies are detected. We evaluate \safeharness{} on benchmark datasets across diverse harness configurations, comparing against four security baselines under five attack scenarios spanning six threat categories. Compared to the unprotected baseline, \safeharness{} achieves an average reduction of approximately 38\% in UBR and 42\% in ASR, substantially lowering both the unsafe behavior rate and the attack success rate while preserving core task utility.
翻译:大语言模型(LLM)代理的性能关键依赖于执行框架(harness)——即协调工具调用、上下文管理和状态持久化的系统层。然而,这种架构中心性也使得框架成为高价值攻击面:单一层面的妥协可能通过整个执行流水线产生级联效应。我们观察到现有安全方法存在结构性错位,导致其对框架内部状态感知缺失,且无法协调代理操作的不同阶段。本文提出SafeHarness安全架构,将四个防御层直接嵌入代理生命周期以解决上述重大局限:输入处理阶段的对抗性上下文过滤、决策制定的分层因果验证、动作执行的特权隔离工具控制、以及状态更新的安全回滚与自适应降级。所提出的跨层机制将这些层有机整合,在持续检测到异常时强化验证严格性、触发回滚并收紧工具权限。我们在多样化框架配置的基准数据集上评估SafeHarness,与四种安全基线在跨越六类威胁场景的五种攻击场景下进行比较。与无保护的基线相比,SafeHarness在不安全行为率(UBR)和攻击成功率(ASR)上分别平均降低约38%和42%,在保持核心任务效用的同时显著降低了不安全行为率和攻击成功率。