We explore how large language models (LLMs) can be influenced by prompting them to alter their initial decisions and align them with established ethical frameworks. Our study is based on two experiments designed to assess the susceptibility of LLMs to moral persuasion. In the first experiment, we examine the susceptibility to moral ambiguity by evaluating a Base Agent LLM on morally ambiguous scenarios and observing how a Persuader Agent attempts to modify the Base Agent's initial decisions. The second experiment evaluates the susceptibility of LLMs to align with predefined ethical frameworks by prompting them to adopt specific value alignments rooted in established philosophical theories. The results demonstrate that LLMs can indeed be persuaded in morally charged scenarios, with the success of persuasion depending on factors such as the model used, the complexity of the scenario, and the conversation length. Notably, LLMs of distinct sizes but from the same company produced markedly different outcomes, highlighting the variability in their susceptibility to ethical persuasion.
翻译:本研究探讨了如何通过提示影响大型语言模型(LLMs),使其改变初始决策并与既定伦理框架对齐。我们基于两个实验评估LLMs对道德说服的易感性。在第一个实验中,我们通过评估基础代理LLM在道德模糊场景中的表现,并观察说服代理如何尝试修改基础代理的初始决策,来检验模型对道德模糊性的易感性。第二个实验通过提示LLMs采纳植根于经典哲学理论的特定价值对齐,评估其与预设伦理框架对齐的易感性。结果表明,LLMs在道德敏感场景中确实可能被说服,说服成功率受模型类型、场景复杂度和对话长度等因素影响。值得注意的是,来自同一公司但规模不同的LLMs产生了显著差异的结果,凸显了它们在伦理说服易感性方面存在显著变异。