Foundation models such as GPT-4 are fine-tuned to avoid unsafe or otherwise problematic behavior, so that, for example, they refuse to comply with requests for help with committing crimes or with producing racist text. One approach to fine-tuning, called reinforcement learning from human feedback, learns from humans' expressed preferences over multiple outputs. Another approach is constitutional AI, in which the input from humans is a list of high-level principles. But how do we deal with potentially diverging input from humans? How can we aggregate the input into consistent data about ''collective'' preferences or otherwise use it to make collective choices about model behavior? In this paper, we argue that the field of social choice is well positioned to address these questions, and we discuss ways forward for this agenda, drawing on discussions in a recent workshop on Social Choice for AI Ethics and Safety held in Berkeley, CA, USA in December 2023.
翻译:GPT-4等基础模型经过微调以避免不安全或其他有问题的行为,例如,它们拒绝协助犯罪或生成种族主义言论的请求。一种称为基于人类反馈的强化学习的微调方法,能从人类对多个输出的明确偏好中学习。另一种方法是宪政AI,其中人类的输入是一系列高级原则。然而,我们应当如何处理可能相互冲突的人类输入?如何将这些输入聚合成关于“集体”偏好的一致数据,或以此为基础就模型行为做出集体选择?在本文中,我们论证社会选择领域完全有能力解决这些问题,并基于2023年12月在美国加州伯克利举办的“AI伦理与安全的社会选择”研讨会的讨论,提出了推进这一议程的可行方向。