What are human values, and how do we align AI to them?

There is an emerging consensus that we need to align AI systems with human values (Gabriel, 2020; Ji et al., 2024), but it remains unclear how to apply this to language models in practice. We split the problem of "aligning to human values" into three parts: first, eliciting values from people; second, reconciling those values into an alignment target for training ML models; and third, actually training the model. In this paper, we focus on the first two parts, and ask the question: what are "good" ways to synthesize diverse human inputs about values into a target for aligning language models? To answer this question, we first define a set of 6 criteria that we believe must be satisfied for an alignment target to shape model behavior in accordance with human values. We then propose a process for eliciting and reconciling values called Moral Graph Elicitation (MGE), which uses a large language model to interview participants about their values in particular contexts; our approach is inspired by the philosophy of values advanced by Taylor (1977), Chang (2004), and others. We trial MGE with a representative sample of 500 Americans, on 3 intentionally divisive prompts (e.g. advice about abortion). Our results demonstrate that MGE is promising for improving model alignment across all 6 criteria. For example, almost all participants (89.1%) felt well represented by the process, and (89%) thought the final moral graph was fair, even if their value wasn't voted as the wisest. Our process often results in "expert" values (e.g. values from women who have solicited abortion advice) rising to the top of the moral graph, without defining who is considered an expert in advance.

翻译：关于需要将AI系统与人类价值观对齐（Gabriel, 2020; Ji et al., 2024）正形成共识，但如何将其应用于语言模型实践仍不明确。我们将"与人类价值观对齐"的问题拆解为三部分：首先，从人类中获取价值观；其次，将这些价值观协调成用于训练机器学习模型的对齐目标；第三，实际训练模型。本文聚焦前两部分，并提出问题：如何"妥善地"将多样的人类价值观输入综合为语言模型的对齐目标？为回答该问题，我们首先定义一组需满足的6条标准，使对齐目标能根据人类价值观塑造模型行为。随后提出一种名为"道德图谱获取法（MGE）"的价值观获取与协调流程，该方法利用大型语言模型在特定情境下访谈参与者获取其价值观；该方案受Taylor（1977）、Chang（2004）等学者的价值观哲学启发。我们以500名美国人为代表性样本，围绕三个刻意设计的争议性提示（如关于堕胎的建议）开展MGE试验。结果表明，MGE在改善模型对齐所有6条标准方面极具潜力。例如，几乎所有参与者（89.1%）认为该流程能充分代表自身观点，且89%的参与者认为最终形成的道德图谱是公平的——即使其价值观未被选为最明智选项。该流程往往能使"专家型"价值观（如曾寻求堕胎建议的女性持有的价值观）在道德图谱中位列前茅，且无需事先定义专家身份。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

斯坦福李飞飞高徒Johnson博士论文: 组成式计算机视觉智能,195页PDF

专知会员服务

71+阅读 · 2019年10月27日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日