Foundation models such as GPT-4 are fine-tuned to avoid unsafe or otherwise problematic behavior, such as helping to commit crimes or producing racist text. One approach to fine-tuning, called reinforcement learning from human feedback, learns from humans' expressed preferences over multiple outputs. Another approach is constitutional AI, in which the input from humans is a list of high-level principles. But how do we deal with potentially diverging input from humans? How can we aggregate the input into consistent data about "collective" preferences or otherwise use it to make collective choices about model behavior? In this paper, we argue that the field of social choice is well positioned to address these questions, and we discuss ways forward for this agenda, drawing on discussions in a recent workshop on Social Choice for AI Ethics and Safety held in Berkeley, CA, USA in December 2023.
翻译:诸如GPT-4之类的基础模型通过微调来避免不安全或其他有问题的行为,例如协助犯罪或生成种族主义文本。一种称为基于人类反馈的强化学习的微调方法,通过人类对多个输出结果的表达偏好进行学习。另一种方法是宪法人工智能,其中人类输入是一系列高级原则。然而,我们如何处理人类可能存在的分歧性输入?如何将这些输入聚合成关于“集体”偏好的一致数据,或利用其做出关于模型行为的集体选择?本文认为,社会选择领域完全有能力解决这些问题,并基于2023年12月在美国加州伯克利举行的“人工智能伦理与安全的社会选择”研讨会的讨论,提出了推进该议程的可行路径。