With the rapid development and uptake of large language models (LLMs) across high-stakes settings, it is increasingly important to ensure that LLMs behave in ways that align with human values. Existing moral benchmarks prompt LLMs with value statements, moral scenarios, or psychological questionnaires, with the implicit underlying assumption that LLMs report somewhat stable moral preferences. However, moral psychology research has shown that human moral judgements are sensitive to morally irrelevant situational factors, such as smelling cinnamon rolls or the level of ambient noise, thereby challenging moral theories that assume the stability of human moral judgements. Here, we draw inspiration from this "situationist" view of moral psychology to evaluate whether LLMs exhibit similar cognitive moral biases to humans. We curate a novel multimodal dataset of 60 "moral distractors" from existing psychological datasets of emotionally-valenced images and narratives which have no moral relevance to the situation presented. After injecting these distractors into existing moral benchmarks to measure their effects on LLM responses, we find that moral distractors can shift the moral judgements of LLMs by over 30% even in low-ambiguity scenarios, highlighting the need for more contextual moral evaluations and more nuanced cognitive moral modeling of LLMs.
翻译:随着大型语言模型(LLM)在关键应用场景中的快速发展和普及,确保其行为与人类价值观保持一致变得日益重要。现有的道德基准测试通过价值陈述、道德情境或心理问卷来提示LLM,其隐含的基本假设是LLM能够表现出相对稳定的道德偏好。然而,道德心理学研究表明,人类的道德判断容易受到道德无关情境因素的影响,例如闻到肉桂卷的气味或环境噪音水平,这挑战了那些假定人类道德判断具有稳定性的道德理论。本文从道德心理学中的“情境主义”视角出发,评估LLM是否表现出与人类类似的认知道德偏差。我们从现有的情感效价图像和叙事心理数据集中,精心构建了一个包含60个“道德干扰项”的新型多模态数据集,这些干扰项与所呈现情境无道德关联。通过将这些干扰项注入现有道德基准测试以测量其对LLM响应的影响,我们发现即使在低模糊性情境中,道德干扰项也能使LLM的道德判断产生超过30%的偏移,这凸显了进行更具情境感知的道德评估以及建立更精细的LLM认知道德模型的必要性。