LLM agents do not act on raw interaction history; they act on a bounded decision state assembled by truncation, summarization, reordering, and rewriting. If directive-bearing state is dropped, weakened, or rebound during that step, an agent can cross a policy boundary without prompt override, model changes, or persistent-memory compromise. We study this failure mode over local Llama 3.1 8B, Qwen 2.5 7B, and Mistral 7B using judged exact constraint respect and direct audits of assembled-state visibility. We evaluate SafeContext, a control layer that pins control state, reuses retained control prefixes, and optionally injects reminders under pressure while keeping model weights fixed. Unmitigated risk is systematic, but absolute exact compliance remains low. Against truncation, SafeContext yields small gains; against a strong structured-compaction policy, most aggregate lift disappears, leaving residual benefit mainly in overflow eviction and selected aliasing slices. Replay-only does not explain the effect. A larger-model extension on Qwen 14B and Llama 70B shows the same failure object under larger models, although sign and magnitude remain policy-conditional. Decision-time context assembly is therefore a measurable part of the control path that can be partially hardened.
翻译:LLM代理并非基于原始交互历史行动,而是基于通过截断、摘要、重排序和重写构建的有界决策状态。若指令承载状态在该步骤中被丢弃、弱化或重新绑定,代理可在无需提示覆盖、模型更改或持久记忆受损的情况下跨越策略边界。我们通过精确约束判定和装配状态可见性直接审计,在本地Llama 3.1 8B、Qwen 2.5 7B和Mistral 7B上研究此失效模式。我们评估了SafeContext控制层,该层固定控制状态、复用保留的控制前缀,并在保持模型权重固定的情况下可选地在压力下注入提醒。未缓解的风险是系统性的,但绝对精确合规率仍然较低。针对截断,SafeContext带来小幅提升;面对强结构化压缩策略,大部分聚合增益消失,剩余收益主要存在于溢出驱逐和特定别名校验。仅重放无法解释该效应。在Qwen 14B和Llama 70B上的更大模型扩展显示,更大模型下存在相同失效对象,尽管符号和幅度仍取决于策略。因此,决策时上下文装配是控制路径中可测量的部分,且可被部分强化。