Large language models (LLMs) can generate high-level diverse phenomena without explicitly programmed rules. This capability has led to their adoption within different agent-based models (ABMs) and social simulations. Recent studies investigate their ability to generate different phenomena of interest, for example, human behavior on social media platforms or alien behavior in game-theoretic scenarios. However, capability, prediction, and explanation are different--drawing from the philosophy of science and mechanisms literature, explanation requires showing, to some degree, how a phenomenon is produced by related organized entities and activities. For modelers, describing the characteristics of an experiment or whether a simulation provides progress in capability (or explanation), can be difficult without being grounded in potentially distant research areas. We integrate recent work on LLM-ABMs with contemporary philosophy of science literature and use it to operationalize a definition of 'plausibility' in a four-level scale. Our scale separates the evaluation of a model's generative sufficiency (ability to reproduce a phenomenon) from its mechanistic plausibility (how the phenomenon could be produced), and clarifies the distinct roles of different models, such as predictive and explanatory ones. We introduce this as the Mechanism Plausibility Scale.
翻译:大型语言模型(LLMs)能够在无需显式编程规则的情况下生成多样化的高层现象。这一能力使其被广泛应用于各类智能体模型(ABMs)及社会模拟中。近期研究探索了它们生成不同目标现象的能力,例如社交媒体平台上的人类行为或博弈论场景中的外星行为。然而,能力、预测与解释三者存在本质区别——基于科学哲学与机制相关文献,解释需要在一定程度上揭示现象如何由关联的有组织实体和活动所生成。对于建模者而言,若缺乏对可能并非本领域研究的知识体系的理解,描述实验特性或判断模拟在能力(或解释)层面的进展将变得困难。我们将LLM-ABMs的前沿研究与当代科学哲学文献相结合,基于此操作化定义了“可信性”概念,并将其划分为四级量表。该量表将模型生成充分性(复现现象的能力)与机制可信性(现象可能生成的方式)的评估相分离,并厘清了预测模型与解释模型等不同模型所承担的独特角色。我们将此定义为机制可信性量表。