With increased power and prevalence of AI systems, it is ever more critical that AI systems are designed to serve all, i.e., people with diverse values and perspectives. However, aligning models to serve pluralistic human values remains an open research question. In this piece, we propose a roadmap to pluralistic alignment, specifically using language models as a test bed. We identify and formalize three possible ways to define and operationalize pluralism in AI systems: 1) Overton pluralistic models that present a spectrum of reasonable responses; 2) Steerably pluralistic models that can steer to reflect certain perspectives; and 3) Distributionally pluralistic models that are well-calibrated to a given population in distribution. We also propose and formalize three possible classes of pluralistic benchmarks: 1) Multi-objective benchmarks, 2) Trade-off steerable benchmarks, which incentivize models to steer to arbitrary trade-offs, and 3) Jury-pluralistic benchmarks which explicitly model diverse human ratings. We use this framework to argue that current alignment techniques may be fundamentally limited for pluralistic AI; indeed, we highlight empirical evidence, both from our own experiments and from other work, that standard alignment procedures might reduce distributional pluralism in models, motivating the need for further research on pluralistic alignment.
翻译:随着人工智能系统能力的增强与普及,确保其服务于拥有多元价值观和视角的全体人群变得前所未有的重要。然而,使模型对齐多元人类价值观仍是一个开放的研究问题。本文以语言模型为试验平台,提出了一条多元对齐的路线图。我们识别并形式化了在人工智能系统中定义和实现多元性的三种可行方式:1)奥弗顿多元模型,呈现合理响应的光谱;2)可操控多元模型,能够导向反映特定视角;3)分布多元模型,在分布上对特定群体实现良好校准。同时,我们提出并形式化了三类多元基准:1)多目标基准,2)权衡可操控基准(激励模型导向任意权衡),3)陪审团多元基准(显式建模多元人类评分)。基于该框架,我们认为当前对齐技术在实现多元人工智能方面可能存在根本性局限;通过自身实验及其他研究的实证证据表明,标准对齐流程可能削弱模型的分布多元性,这凸显了开展多元对齐进一步研究的必要性。