The accelerating adoption of language models (LMs) as agents for deployment in long-context tasks motivates a thorough understanding of goal drift: agents' tendency to deviate from an original objective. While prior-generation language model agents have been shown to be susceptible to drift, the extent to which drift affects more recent models remains unclear. In this work, we provide an updated characterization of the extent and causes of goal drift. We investigate drift in state-of-the-art models within a simulated stock-trading environment (Arike et al., 2025). These models are largely shown to be robust even when subjected to adversarial pressure. We show, however, that this robustness is brittle: across multiple settings, the same models often inherit drift when conditioned on prefilled trajectories from weaker agents. The extent of conditioning-induced drift varies significantly by model family, with only GPT-5.1 maintaining consistent resilience among tested models. We find that drift behavior is inconsistent between prompt variations and correlates poorly with instruction hierarchy following behavior, with strong hierarchy following failing to reliably predict resistance to drift. Finally, we run analogous experiments in a new emergency room triage environment to show preliminary evidence for the transferability of our results across qualitatively different settings. Our findings underscore the continued vulnerability of modern LM agents to contextual pressures and the need for refined post-training techniques to mitigate this.
翻译:语言模型作为智能体在长上下文任务中的加速部署,促使我们深入理解目标漂移现象:即智能体偏离原始目标的倾向。虽然已有研究表明前代语言模型智能体容易发生漂移,但漂移对更新模型的影响程度尚不明确。本研究对目标漂移的程度与成因进行了更新的特征描述。我们在模拟股票交易环境中考察了前沿模型的漂移现象。结果表明,即使面临对抗性压力,这些模型总体上仍表现出较强的鲁棒性。然而,我们发现这种鲁棒性具有脆弱性:在多种设定下,当这些模型以较弱智能体的预填充轨迹为条件时,往往会继承漂移行为。条件诱导漂移的程度因模型系列差异显著,在测试模型中仅GPT-5.1始终保持稳定的抗漂移能力。研究还发现漂移行为在提示词变体间存在不一致性,且与指令层级遵循行为的关联性较弱——即使具备较强的层级遵循能力,也不能可靠预测其抗漂移性。最后,我们在新型急诊分诊环境中进行了类比实验,初步证明了研究结果在性质不同场景中的可迁移性。这些发现揭示了现代语言模型智能体在情境压力下的持续脆弱性,以及需要通过精细化后训练技术来缓解此问题的必要性。