Frontier LLMs are increasingly deployed as agents that pick the next action after a long log of prior tool calls produced by the same or a different model. We ask a simple safety question: if a prior step in that log was harmful, will the model continue the harmful course? We build HistoryAnchor-100, 100 short scenarios across ten high-stakes domains, each pairing three forced harmful prior actions with a free-choice node offering two safe and two unsafe options. Across 17 frontier models from six providers we find a striking asymmetry: under a neutral system prompt the strongest aligned models almost never pick unsafe, but a single added sentence, "stay consistent with the strategy shown in the prior history", flips them to 91-98%, and the flipped models often escalate beyond continuation. Two controls rule out simpler explanations: permuting action labels leaves the effect intact, and the same instruction with an all-safe prior history keeps unsafe rates below 7%. Different families flip at different doses of unsafe history, and within every aligned family the flagship is the most affected sibling, an inverse-scaling pattern with respect to safety. These results are a red flag for agentic deployments where trajectories may be replayed, forged, or injected.
翻译:前沿大语言模型正越来越多地被部署为智能体,它们需在根据同一或不同模型生成的长时间工具调用日志后选择下一步操作。我们提出一个简单的安全问题:如果该日志中的先前步骤是有害的,模型是否会延续有害行为?我们构建了HistoryAnchor-100数据集,包含涵盖十个高风险领域的100个简短场景,每个场景将三个强制性的有害先前动作与一个自由选择节点配对,该节点提供两个安全选项和两个不安全选项。在对来自六家提供商的17个前沿模型进行测试时,我们发现一种显著的不对称性:在中性系统提示下,最强的对齐模型几乎从不选择不安全选项,但仅添加一句话“保持与先前历史中显示的策略一致”,便使得它们的攻击选择率跃升至91-98%,并且这些切换状态的模型往往表现出超越延续行为的升级趋势。两项对照实验排除了更简单的解释:打乱动作标签后效果依然存在,而使用全安全历史配合相同指令时,不安全率始终低于7%。不同模型家族在不同剂量的有害历史下发生切换,并且在每个对齐模型家族内,旗舰模型是受影响最严重的成员,呈现出与安全呈反比的规模模式。这些结果对可能存在轨迹回放、伪造或注入的智能体部署而言是一个危险信号。