Prior research indicates that to be able to mediate conflict, observers of disagreements between parties must be able to reliably distinguish the sources of their disagreement as stemming from differences in beliefs about what is true (causality) vs. differences in what they value (morality). In this paper, we test if OpenAI's Large Language Models GPT 3.5 and GPT 4 can perform this task and whether one or other type of disagreement proves particularly challenging for LLM's to diagnose. We replicate study 1 in Ko\c{c}ak et al. (2003), which employes a vignette design, with OpenAI's GPT 3.5 and GPT 4. We find that both LLMs have similar semantic understanding of the distinction between causal and moral codes as humans and can reliably distinguish between them. When asked to diagnose the source of disagreement in a conversation, both LLMs, compared to humans, exhibit a tendency to overestimate the extent of causal disagreement and underestimate the extent of moral disagreement in the moral misalignment condition. This tendency is especially pronounced for GPT 4 when using a proximate scale that relies on concrete language specific to an issue. GPT 3.5 does not perform as well as GPT4 or humans when using either the proximate or the distal scale. The study provides a first test of the potential for using LLMs to mediate conflict by diagnosing the root of disagreements in causal and evaluative codes.
翻译:先前研究表明,要调解冲突,观察者必须能够可靠地区分争议双方分歧的来源:是源于对事实认知的差异(因果性),还是源于价值取向的差异(道德性)。本文测试了OpenAI的大型语言模型GPT 3.5和GPT 4能否完成此任务,并探究其中一类分歧是否对LLM的诊断构成特殊挑战。我们采用Koçak等人(2003)研究1中的情景设计方法,在OpenAI的GPT 3.5和GPT 4上复现了该实验。研究发现,两种LLM与人类相似,都能从语义上理解因果准则与道德准则的区分,并能可靠地区分二者。当要求诊断对话中的分歧来源时,在道德错位情境下,两种LLM与人类相比都表现出高估因果分歧程度、低估道德分歧程度的倾向。这种倾向在使用依赖具体议题语言的近似量表时,在GPT 4上尤为明显。无论使用近似量表还是远端量表,GPT 3.5的表现均不及GPT 4或人类。本研究首次通过诊断因果准则与评价准则中的分歧根源,检验了使用LLM调解冲突的潜力。