When large language models (LLMs) are asked to perform certain tasks, how can we be sure that their learned representations align with reality? We propose a domain-agnostic framework for systematically evaluating distribution shifts in LLMs decision-making processes, where they are given control of mechanisms governed by pre-defined rules. While individual LLM actions may appear consistent with expected behavior, across a large number of trials, statistically significant distribution shifts can emerge. To test this, we construct a well-defined environment with known outcome logic: blackjack. In more than 1,000 trials, we uncover statistically significant evidence suggesting behavioral misalignment in the learned representations of LLM.
翻译:当大型语言模型(LLM)被要求执行特定任务时,我们如何确保其习得的表征与现实世界保持一致?本文提出一种领域无关的框架,用于系统评估LLM决策过程中的分布偏移。在该框架中,LLM被赋予对预先定义规则所控制机制的操作权限。尽管单个LLM行为可能看似符合预期,但在大量试验中,仍会出现具有统计显著性的分布偏移。为验证此现象,我们构建了具有明确结果逻辑的标准化环境:二十一点游戏。通过超过1000次试验,我们发现了具有统计显著性的证据,表明LLM习得表征中存在行为失准现象。