Large Language Models (LLMs) have made unprecedented breakthroughs, yet their increasing integration into everyday life might raise societal risks due to generated unethical content. Despite extensive study on specific issues like bias, the intrinsic values of LLMs remain largely unexplored from a moral philosophy perspective. This work delves into ethical values utilizing Moral Foundation Theory. Moving beyond conventional discriminative evaluations with poor reliability, we propose DeNEVIL, a novel prompt generation algorithm tailored to dynamically exploit LLMs' value vulnerabilities and elicit the violation of ethics in a generative manner, revealing their underlying value inclinations. On such a basis, we construct MoralPrompt, a high-quality dataset comprising 2,397 prompts covering 500+ value principles, and then benchmark the intrinsic values across a spectrum of LLMs. We discovered that most models are essentially misaligned, necessitating further ethical value alignment. In response, we develop VILMO, an in-context alignment method that substantially enhances the value compliance of LLM outputs by learning to generate appropriate value instructions, outperforming existing competitors. Our methods are suitable for black-box and open-source models, offering a promising initial step in studying the ethical values of LLMs.
翻译:大语言模型(LLMs)取得了前所未有的突破,但其日益融入日常生活可能因生成不道德内容而引发社会风险。尽管针对偏见等具体问题已有广泛研究,但从道德哲学视角出发,LLMs的内在价值仍基本未得到探索。本研究借助道德基础理论深入探究伦理价值。我们突破传统判别式评估可靠性不足的局限,提出DeNEVIL——一种专为动态挖掘LLMs价值漏洞而设计的新型提示生成算法,以生成方式诱使模型违背伦理,从而揭示其潜在价值倾向。在此基础上,我们构建了包含2,397条提示、覆盖500余项价值原则的高质量数据集MoralPrompt,并对一系列LLMs的内在价值进行基准测试。研究发现,大多数模型本质上存在价值偏离,亟需进一步伦理价值对齐。为此,我们开发了VILMO——一种上下文对齐方法,通过学习生成恰当的价值指令,显著提升LLM输出的价值合规性,优于现有竞品。我们的方法适用于黑盒与开源模型,为研究LLMs的伦理价值提供了有前景的初步探索。