A crucial consideration when developing and deploying Large Language Models (LLMs) is the human values to which these models are aligned. In the constitutional framework of alignment models are aligned to a set of principles (the constitution) specified in natural language. However, it is unclear how to fairly determine this constitution with widespread stakeholder input. In this work we propose Grounded Constitutional AI (GCAI), a unified framework for generating constitutions of principles that are representative of both users' general expectations toward AI (general principles) and their interaction-time preferences (contextual principles). We extend the Inverse Constitutional AI (ICAI) approach to generate contextual principles from human preference annotation data by leveraging human-provided \textit{reasons} for their preferences. We supplement these contextual principles with general principles surfaced from user statements of \textit{values} regarding AI. We show that a constitution generated by GCAI is preferred by humans over one generated through ICAI both personally, and for widespread use in governing AI behavior. Additionally participants consider the GCAI constitution to be more morally grounded, coherent, and pluralistic.
翻译:开发与部署大型语言模型(LLM)时,一个关键考量在于这些模型所对齐的人类价值观。在对齐的宪法框架中,模型被对齐至一组以自然语言表述的原则(即宪法)。然而,如何公平地确定这部宪法以广泛纳入利益相关者的意见尚不明确。在本工作中,我们提出**基于理由的宪法人工智能(GCAI)**,这是一个统一的框架,用于生成既代表用户对人工智能的普遍期望(通用原则)、又体现其交互时偏好(情境原则)的宪法原则集。我们扩展了逆向宪法人工智能(ICAI)方法,通过利用人类为其偏好提供的**理由**,从人类偏好标注数据中生成情境原则。我们进一步通过从用户关于人工智能的**价值观**陈述中提取通用原则,对这些情境原则进行补充。实验表明,无论是从个人角度还是从广泛用于规范人工智能行为的角度,人类均更偏好由GCAI生成的宪法,而非通过ICAI生成的宪法。此外,参与者认为GCAI宪法在道德基础、连贯性和多元性方面表现更优。