State of the art large language models (LLMs) have shown impressive performance on a variety of benchmark tasks and are increasingly used as components in larger applications, where LLM-based predictions serve as proxies for human judgements or decision. This raises questions about the human-likeness of LLM-derived information, alignment with human intuition, and whether LLMs could possibly be considered (parts of) explanatory models of (aspects of) human cognition or language use. To shed more light on these issues, we here investigate the human-likeness of LLMs' predictions for multiple-choice decision tasks from the perspective of Bayesian statistical modeling. Using human data from a forced-choice experiment on pragmatic language use, we find that LLMs do not capture the variance in the human data at the item-level. We suggest different ways of deriving full distributional predictions from LLMs for aggregate, condition-level data, and find that some, but not all ways of obtaining condition-level predictions yield adequate fits to human data. These results suggests that assessment of LLM performance depends strongly on seemingly subtle choices in methodology, and that LLMs are at best predictors of human behavior at the aggregate, condition-level, for which they are, however, not designed to, or usually used to, make predictions in the first place.
翻译:摘要:最先进的大语言模型(LLMs)已在多种基准测试中展现出卓越性能,并日益作为大型应用系统的组件——其中LLM驱动的预测被用作人类判断或决策的代理。这引发了关于LLM衍生信息与人类相似性、与人类直觉的一致性,以及LLM是否可能被视为(部分)解释人类认知或语言使用(方面)的模型的质疑。为深入探讨这些问题,本文从贝叶斯统计建模视角,研究LLM在多选决策任务中预测的人类相似性。通过分析一项关于语用语言使用的强制选择实验的人类数据,我们发现LLM无法在项目层级捕捉人类数据的方差。我们提出了从LLM推导聚合条件层级数据完整分布预测的不同方法,并发现部分(而非全部)获取条件层级预测的方法能与人类数据实现良好拟合。这些结果表明,对LLM性能的评估高度依赖于方法论中看似细微的选择,且LLM充其量仅能作为聚合条件层级人类行为的预测器——尽管它们最初并非设计用于(或通常被用于)此类预测。