Large language models (LLMs) represent prompt-conditioned beliefs (posteriors over answers and claims), but we lack a mechanistic account of how these beliefs are encoded in representation space, how they update with new evidence, and how interventions reshape them. We study a controlled setting in which Llama-3.2 generates samples from a normal distribution by implicitly inferring its parameters (mean and standard deviation) given only samples from the distribution in context. We find representations of curved "belief manifolds" for these parameters form with sufficient in-context learning and study how the model adapts when the distribution suddenly changes. While standard linear steering often pushes the model off-manifold and induces coupled, out-of-distribution shifts, geometry and field-aware steering better preserves the intended belief family. Our work demonstrates an example of linear field probing (LFP) as a simple approach to tile the data manifold and make interventions that respect the underlying geometry. We conclude that rich structure emerges naturally in LLMs and that purely linear concept representations are often an inadequate abstraction.
翻译:大型语言模型(LLM)表征了提示条件下的信念(对答案与主张的后验分布),但我们缺乏对其在表示空间中编码机制、新证据下更新方式以及干预如何重塑这些信念的机制性解释。我们在受控环境中进行研究,其中Llama-3.2仅根据上下文中的分布样本隐式推断正态分布参数(均值与标准差)并生成样本。我们发现,经过充分的上下文学习后,这些参数会形成弯曲的“信念流形”表示,并探究了当分布突然变化时模型的适应机制。虽然标准线性引导常使模型偏离流形并引发耦合的分布外偏移,但几何与场感知引导能更好地保持目标信念族。本研究通过线性场探测(LFP)这一平铺数据流形并实施符合底层几何结构干预的简易方法提供了实证案例。我们得出结论:丰富结构在LLMs中自然涌现,而纯线性的概念表示往往是不充分的抽象。