Large language models are increasingly used as behavioral simulators, but it remains unclear when their outputs reflect human-like cognitive mechanisms rather than prompt-sensitive surface patterns. We study this question through the realization effect, a well-characterized finding in behavioral economics in which risk-taking differs systematically after paper versus realized gains and losses. We evaluate LLM behavior at three levels: prompt-only behavioral sensitivity, linear readout of internal representations, and causal control via activation steering. Prompt-only results show systematic condition sensitivity, but the directional pattern does not reproduce human realization-effect predictions. Gemma's residual stream contains a linearly decodable realization-status signal at layer 18 that generalizes to held-out prompts. Steering along this direction does not, however, reliably shift downstream risk choices, a null result that holds across positive scales and in a negative sign-symmetry run. Behavioral sensitivity, latent readout, and causal control are three distinct properties that do not automatically co-occur, and successful latent readout is insufficient evidence that a model behaviorally relies on a representation during downstream decision-making.
翻译:大型语言模型日益被用作行为模拟器,但其输出究竟反映类人认知机制,抑或仅是对提示敏感的浅层模式,这一问题仍不明朗。本研究通过实现效应——行为经济学中一个特征明确的发现,即风险承担行为在账面损益与已实现损益之间存在系统性差异——对该问题展开探究。我们在三个层面评估LLM行为:仅基于提示的行为敏感性、内部表征的线性解读,以及通过激活引导实现的因果控制。仅基于提示的结果显示出系统性条件敏感性,但其方向模式未能复现人类实现效应的预测。Gemma的残差流在第18层包含一个可线性解码的实现状态信号,且该信号能泛化至未见提示。然而,沿此方向引导并未可靠地改变下游风险选择——这一零结果在正向尺度和负向符号对称运行中均保持成立。行为敏感性、潜在表征解读与因果控制是三种不可能自动共现的独立属性,成功解读潜在表征不足以证明模型在下游决策中行为性地依赖该表征。