There is increasing interest in using LLMs as decision-making "agents." Doing so includes many degrees of freedom: which model should be used; how should it be prompted; should it be asked to introspect, conduct chain-of-thought reasoning, etc? Settling these questions -- and more broadly, determining whether an LLM agent is reliable enough to be trusted -- requires a methodology for assessing such an agent's economic rationality. In this paper, we provide one. We begin by surveying the economic literature on rational decision making, taxonomizing a large set of fine-grained "elements" that an agent should exhibit, along with dependencies between them. We then propose a benchmark distribution that quantitatively scores an LLMs performance on these elements and, combined with a user-provided rubric, produces a "STEER report card." Finally, we describe the results of a large-scale empirical experiment with 14 different LLMs, characterizing the both current state of the art and the impact of different model sizes on models' ability to exhibit rational behavior.
翻译:将大型语言模型用作决策"智能体"的兴趣日益增长。这一应用涉及诸多自由度:应选用何种模型?应如何设计提示?是否应要求其进行内省、开展思维链推理等?要解决这些问题——更广泛而言,要判定某个LLM智能体是否足够可靠值得信赖——需要建立评估此类智能体经济理性的方法论。本文即提出这样一种方法。我们首先梳理了理性决策的经济学文献,对智能体应展现的大量细粒度"理性要素"及其相互依赖关系进行了系统分类。随后提出一个基准评估体系,可量化评估LLM在这些要素上的表现,并结合用户提供的评估准则生成"STEER评估报告"。最后,我们通过对14个不同LLM开展的大规模实证实验,描述了当前技术发展现状,并揭示了模型规模对展现理性行为能力的影响。