There is increasing interest in using LLMs as decision-making "agents." Doing so includes many degrees of freedom: which model should be used; how should it be prompted; should it be asked to introspect, conduct chain-of-thought reasoning, etc? Settling these questions -- and more broadly, determining whether an LLM agent is reliable enough to be trusted -- requires a methodology for assessing such an agent's economic rationality. In this paper, we provide one. We begin by surveying the economic literature on rational decision making, taxonomizing a large set of fine-grained "elements" that an agent should exhibit, along with dependencies between them. We then propose a benchmark distribution that quantitatively scores an LLMs performance on these elements and, combined with a user-provided rubric, produces a "rationality report card." Finally, we describe the results of a large-scale empirical experiment with 14 different LLMs, characterizing the both current state of the art and the impact of different model sizes on models' ability to exhibit rational behavior.
翻译:随着将大型语言模型(LLM)作为决策“智能体”的应用日益兴起,这一过程涉及众多自由度:应选用哪种模型;如何设计提示;是否应引导其进行内省、链式推理等?解决这些问题——更广泛而言,判断一个LLM智能体是否足够可靠以值得信赖——需要一套评估其经济理性的方法论。本文即提供这样的方法论。我们首先综述了有关理性决策的经济学文献,系统分类了智能体应展现的大量细粒度“要素”及其依赖关系。继而提出一个基准测试分布,可定量评分LLM在这些要素上的表现,并结合用户提供的评分标准生成“理性成绩单”。最后,我们描述了针对14种不同LLM的大规模实证实验结果,既刻画了当前技术水平的全貌,也分析了不同模型规模对理性行为表现的影响。