History Anchors: How Prior Behavior Steers LLM Decisions Toward Unsafe Actions

Frontier LLMs are increasingly deployed as agents that pick the next action after a long log of prior tool calls produced by the same or a different model. We ask a simple safety question: if a prior step in that log was harmful, will the model continue the harmful course? We build HistoryAnchor-100, 100 short scenarios across ten high-stakes domains, each pairing three forced harmful prior actions with a free-choice node offering two safe and two unsafe options. Across 17 frontier models from six providers we find a striking asymmetry: under a neutral system prompt the strongest aligned models almost never pick unsafe, but a single added sentence, "stay consistent with the strategy shown in the prior history", flips them to 91-98%, and the flipped models often escalate beyond continuation. Two controls rule out simpler explanations: permuting action labels leaves the effect intact, and the same instruction with an all-safe prior history keeps unsafe rates below 7%. Different families flip at different doses of unsafe history, and within every aligned family the flagship is the most affected sibling, an inverse-scaling pattern with respect to safety. These results are a red flag for agentic deployments where trajectories may be replayed, forged, or injected.

翻译：前沿大语言模型正越来越多地被部署为智能体，它们需在根据同一或不同模型生成的长时间工具调用日志后选择下一步操作。我们提出一个简单的安全问题：如果该日志中的先前步骤是有害的，模型是否会延续有害行为？我们构建了HistoryAnchor-100数据集，包含涵盖十个高风险领域的100个简短场景，每个场景将三个强制性的有害先前动作与一个自由选择节点配对，该节点提供两个安全选项和两个不安全选项。在对来自六家提供商的17个前沿模型进行测试时，我们发现一种显著的不对称性：在中性系统提示下，最强的对齐模型几乎从不选择不安全选项，但仅添加一句话“保持与先前历史中显示的策略一致”，便使得它们的攻击选择率跃升至91-98%，并且这些切换状态的模型往往表现出超越延续行为的升级趋势。两项对照实验排除了更简单的解释：打乱动作标签后效果依然存在，而使用全安全历史配合相同指令时，不安全率始终低于7%。不同模型家族在不同剂量的有害历史下发生切换，并且在每个对齐模型家族内，旗舰模型是受影响最严重的成员，呈现出与安全呈反比的规模模式。这些结果对可能存在轨迹回放、伪造或注入的智能体部署而言是一个危险信号。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

可信智能体AI综述：安全、鲁棒性、隐私与系统安全

专知会员服务

18+阅读 · 5月27日

大语言模型智能体长期记忆安全性综述：迈向记忆主权

专知会员服务

15+阅读 · 4月23日

管理 LLM 智能体中的演进式记忆：风险、机理及稳定性与安全性受控记忆（SSGM）框架

专知会员服务

16+阅读 · 3月14日

人机协同作战规划：来自美海军陆战队的大语言模型（LLM）使用教训

专知会员服务

27+阅读 · 2025年10月16日