Denevil: Towards Deciphering and Navigating the Ethical Values of Large Language Models via Instruction Learning

Large Language Models (LLMs) have made unprecedented breakthroughs, yet their increasing integration into everyday life might raise societal risks due to generated unethical content. Despite extensive study on specific issues like bias, the intrinsic values of LLMs remain largely unexplored from a moral philosophy perspective. This work delves into ethical values utilizing Moral Foundation Theory. Moving beyond conventional discriminative evaluations with poor reliability, we propose DeNEVIL, a novel prompt generation algorithm tailored to dynamically exploit LLMs' value vulnerabilities and elicit the violation of ethics in a generative manner, revealing their underlying value inclinations. On such a basis, we construct MoralPrompt, a high-quality dataset comprising 2,397 prompts covering 500+ value principles, and then benchmark the intrinsic values across a spectrum of LLMs. We discovered that most models are essentially misaligned, necessitating further ethical value alignment. In response, we develop VILMO, an in-context alignment method that substantially enhances the value compliance of LLM outputs by learning to generate appropriate value instructions, outperforming existing competitors. Our methods are suitable for black-box and open-source models, offering a promising initial step in studying the ethical values of LLMs.

翻译：大语言模型（LLMs）取得了前所未有的突破，但其日益融入日常生活可能因生成不道德内容而引发社会风险。尽管针对偏见等具体问题已有广泛研究，但从道德哲学视角出发，LLMs的内在价值仍基本未得到探索。本研究借助道德基础理论深入探究伦理价值。我们突破传统判别式评估可靠性不足的局限，提出DeNEVIL——一种专为动态挖掘LLMs价值漏洞而设计的新型提示生成算法，以生成方式诱使模型违背伦理，从而揭示其潜在价值倾向。在此基础上，我们构建了包含2,397条提示、覆盖500余项价值原则的高质量数据集MoralPrompt，并对一系列LLMs的内在价值进行基准测试。研究发现，大多数模型本质上存在价值偏离，亟需进一步伦理价值对齐。为此，我们开发了VILMO——一种上下文对齐方法，通过学习生成恰当的价值指令，显著提升LLM输出的价值合规性，优于现有竞品。我们的方法适用于黑盒与开源模型，为研究LLMs的伦理价值提供了有前景的初步探索。

相关内容

大语言模型

关注 67

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

14+阅读 · 2022年3月12日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日