Persistent language-model agents increasingly combine tool use, tiered memory, reflective prompting, and runtime adaptation. In such systems, behavior is shaped not only by current prompts but by mutable internal conditions that influence future action. This paper introduces layered mutability, a framework for reasoning about that process across five layers: pretraining, post-training alignment, self-narrative, memory, and weight-level adaptation. The central claim is that governance difficulty rises when mutation is rapid, downstream coupling is strong, reversibility is weak, and observability is low, creating a systematic mismatch between the layers that most affect behavior and the layers humans can most easily inspect. I formalize this intuition with simple drift, governance-load, and hysteresis quantities, connect the framework to recent work on temporal identity in language-model agents, and report a preliminary ratchet experiment in which reverting an agent's visible self-description after memory accumulation fails to restore baseline behavior. In that experiment, the estimated identity hysteresis ratio is 0.68. The main implication is that the salient failure mode for persistent self-modifying agents is not abrupt misalignment but compositional drift: locally reasonable updates that accumulate into a behavioral trajectory that was never explicitly authorized.
翻译:持久化语言模型智能体日益结合工具使用、分层记忆、反思性提示及运行时自适应。在此类系统中,行为不仅受当前提示影响,更取决于影响未来行动的可变内部状态。本文提出分层可变性框架,从五个层级对上述过程进行推理:预训练、后训练对齐、自我叙事、记忆及权重层级自适应。核心论断为:当突变速度快、下游耦合强、可逆性弱且可观测性低时,治理难度呈系统性增长,从而在影响行为最显著的层级与人类最易检测的层级间形成错配。本文通过简易漂移量、治理负载及滞后量对该直觉进行形式化表达,将本框架与近期关于语言模型智能体时间身份的研究相联系,并报告一项初步的棘轮实验——在该实验中,记忆积累后恢复智能体可见自我描述的操作无法重建基线行为,其估计的身份滞后比为0.68。研究主要启示在于:持久化自修改智能体的显著失效模式并非突发性对齐失败,而是组合性漂移——局部合理更新逐次累积,最终形成从未被明确授权的行为轨迹。