Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior

Anticipating LLM behavioral tendencies from low-cost psychometric probes is critical for safe deployment, but only if self-reports (SR) reliably predict behavior. Recent work documented substantial SR-behavior dissociation in LLMs, but relied on broad personality traits (Big 5) that predict specific behaviors weakly, even in humans. Furthermore, the isolation of conversational sessions combined with weak context matching left open whether LLMs truly lack coherence or whether the conditions needed to detect such coherence were not met. We contrast Big 5 with the Theory of Planned Behavior (TPB), which measures intention targeted to a specific behavior and predicts human behavior substantially better than broad traits. We run experiments across four behavioral tasks and 11 frontier LLMs, while also varying session context and identity induction. We find that SR-behavior coherence exists but is selective. 1) Within a shared conversation, the Theory of Planned Behavior reaches human-level coherence; Big 5 does not. 2) Across separate conversations, coherence survives only for behaviors anchored outside the immediate prompt, such as implicit bias shaped by training, and collapses when behavior is strongly primed by context, as with sycophancy. 3) Persona prompting makes self-reports more consistent across conversations, but does not bring behavior into alignment. These findings suggest that coarse personality frameworks, such as Big 5 may not be the best tools for testing deployment behavior. More task- and behavior-specific instruments are needed, and even these must be evaluated across tasks and contexts.

翻译：从低成本的心理学探测中预测大语言模型的行为倾向对安全部署至关重要，但这仅当自我报告能可靠预测行为时才成立。近期研究记录了LLM中显著的自我报告与行为分离现象，但其所依赖的广谱人格特质（大五人格）即便在人类中也只能微弱预测具体行为。此外，对话会话的隔离性与弱情境匹配使得以下问题悬而未决：究竟是LLM真正缺乏一致性，还是检测这种一致性的条件未被满足？我们将大五人格与计划行为理论进行对比，该理论针对特定行为测量意图，且对人类行为的预测效果远优于广谱特质。我们在四个行为任务和11个前沿LLM上开展实验，同时改变会话情境与身份诱导。研究发现，自我报告与行为一致性存在但具有选择性：1）在同一对话中，计划行为理论可达到人类水平的一致性，而大五人格无法实现；2）在跨对话情境下，仅当行为锚定于即时提示之外（如训练塑造的内隐偏见）时，一致性得以维持；而当行为被情境强烈启动（如同调现象）时，一致性瓦解；3）角色提示虽能使跨对话的自我报告更一致，却未能使行为与之对齐。这些发现表明，大五人格等粗粒度人格框架可能并非测试部署行为的最佳工具。需要开发更针对任务和行为的测量工具，且即便使用此类工具，仍需跨任务和情境进行评估。