Decision-Making Behavior Evaluation Framework for LLMs under Uncertain Context

When making decisions under uncertainty, individuals often deviate from rational behavior, which can be evaluated across three dimensions: risk preference, probability weighting, and loss aversion. Given the widespread use of large language models (LLMs) in decision-making processes, it is crucial to assess whether their behavior aligns with human norms and ethical expectations or exhibits potential biases. Several empirical studies have investigated the rationality and social behavior performance of LLMs, yet their internal decision-making tendencies and capabilities remain inadequately understood. This paper proposes a framework, grounded in behavioral economics, to evaluate the decision-making behaviors of LLMs. Through a multiple-choice-list experiment, we estimate the degree of risk preference, probability weighting, and loss aversion in a context-free setting for three commercial LLMs: ChatGPT-4.0-Turbo, Claude-3-Opus, and Gemini-1.0-pro. Our results reveal that LLMs generally exhibit patterns similar to humans, such as risk aversion and loss aversion, with a tendency to overweight small probabilities. However, there are significant variations in the degree to which these behaviors are expressed across different LLMs. We also explore their behavior when embedded with socio-demographic features, uncovering significant disparities. For instance, when modeled with attributes of sexual minority groups or physical disabilities, Claude-3-Opus displays increased risk aversion, leading to more conservative choices. These findings underscore the need for careful consideration of the ethical implications and potential biases in deploying LLMs in decision-making scenarios. Therefore, this study advocates for developing standards and guidelines to ensure that LLMs operate within ethical boundaries while enhancing their utility in complex decision-making environments.

翻译：在不确定情境下进行决策时，个体的行为常偏离理性，这种偏离可从风险偏好、概率权重和损失厌恶三个维度进行评估。鉴于大语言模型在决策过程中的广泛应用，评估其行为是否符合人类规范与伦理预期，或是否存在潜在偏见，显得至关重要。已有若干实证研究探讨了大语言模型的理性与社会行为表现，但其内在决策倾向与能力仍未得到充分理解。本文提出一个基于行为经济学的框架，用以评估大语言模型的决策行为。通过多项选择列表实验，我们在无情境设定下估算了三种商用大语言模型——ChatGPT-4.0-Turbo、Claude-3-Opus 和 Gemini-1.0-pro——的风险偏好程度、概率权重及损失厌恶程度。研究结果表明，大语言模型普遍表现出与人类相似的行为模式，如风险厌恶与损失厌恶，并倾向于高估小概率事件。然而，不同模型在这些行为的表现程度上存在显著差异。我们还探究了模型在嵌入社会人口特征时的行为变化，发现了明显的差异。例如，当赋予性少数群体或身体残障属性时，Claude-3-Opus 表现出更强的风险厌恶倾向，导致其做出更为保守的选择。这些发现强调，在决策场景中部署大语言模型时，需审慎考量其伦理影响与潜在偏见。因此，本研究主张制定相关标准与准则，以确保大语言模型在提升复杂决策环境中实用性的同时，其运作符合伦理边界。