Language model representations often contain linear directions that correspond to high-level concepts. Here, we study the dynamics of these representations: how representations evolve along these dimensions within the context of (simulated) conversations. We find that linear representations can change dramatically over a conversation; for example, information that is represented as factual at the beginning of a conversation can be represented as non-factual at the end and vice versa. These changes are content-dependent; while representations of conversation-relevant information may change, generic information is generally preserved. These changes are robust even for dimensions that disentangle factuality from more superficial response patterns, and occur across different model families and layers of the model. These representation changes do not require on-policy conversations; even replaying a conversation script written by an entirely different model can produce similar changes. However, adaptation is much weaker from simply having a sci-fi story in context that is framed more explicitly as such. We also show that steering along a representational direction can have dramatically different effects at different points in a conversation. These results are consistent with the idea that representations may evolve in response to the model playing a particular role that is cued by a conversation. Our findings may pose challenges for interpretability and steering -- in particular, they imply that it may be misleading to use static interpretations of features or directions, or probes that assume a particular range of features consistently corresponds to a particular ground-truth value. However, these types of representational dynamics also point to exciting new research directions for understanding how models adapt to context.
翻译:语言模型的表征通常包含与高层概念对应的线性方向。本文研究这些表征的动态特性:在(模拟)对话语境中,这些维度上的表征如何演变。我们发现线性表征在对话过程中会发生显著变化;例如,对话开始时被表征为事实的信息可能在对话结束时被表征为非事实,反之亦然。这些变化具有内容依赖性:虽然对话相关信息的表现可能改变,但通用信息通常保持不变。即使对于将事实性与更表面的响应模式解耦的维度,这些变化仍然稳健,并且出现在不同模型家族和模型的不同层级中。这些表征变化不需要在线策略对话;即使重播由完全不同的模型编写的对话脚本也能产生类似变化。然而,若语境中仅存在明确标注为科幻小说的故事,其适应效应则弱得多。我们还证明,沿着表征方向进行调控在对话的不同阶段可能产生截然不同的效果。这些结果与以下观点一致:表征可能因模型扮演对话所提示的特定角色而演变。我们的发现可能对可解释性和调控构成挑战——特别是意味着使用静态的特征或方向解释,或假设特定特征范围始终对应特定真实值的探测方法可能产生误导。然而,这类表征动态也为理解模型如何适应语境指出了令人兴奋的新研究方向。