Perspectives on the Social Impacts of Reinforcement Learning with Human Feedback

Is it possible for machines to think like humans? And if it is, how should we go about teaching them to do so? As early as 1950, Alan Turing stated that we ought to teach machines in the way of teaching a child. Reinforcement learning with human feedback (RLHF) has emerged as a strong candidate toward allowing agents to learn from human feedback in a naturalistic manner. RLHF is distinct from traditional reinforcement learning as it provides feedback from a human teacher in addition to a reward signal. It has been catapulted into public view by multiple high-profile AI applications, including OpenAI's ChatGPT, DeepMind's Sparrow, and Anthropic's Claude. These highly capable chatbots are already overturning our understanding of how AI interacts with humanity. The wide applicability and burgeoning success of RLHF strongly motivate the need to evaluate its social impacts. In light of recent developments, this paper considers an important question: can RLHF be developed and used without negatively affecting human societies? Our objectives are threefold: to provide a systematic study of the social effects of RLHF; to identify key social and ethical issues of RLHF; and to discuss social impacts for stakeholders. Although text-based applications of RLHF have received much attention, it is crucial to consider when evaluating its social implications the diverse range of areas to which it may be deployed. We describe seven primary ways in which RLHF-based technologies will affect society by positively transforming human experiences with AI. This paper ultimately proposes that RLHF has potential to net positively impact areas of misinformation, AI value-alignment, bias, AI access, cross-cultural dialogue, industry, and workforce. As RLHF raises concerns that echo those of existing AI technologies, it will be important for all to be aware and intentional in the adoption of RLHF.

翻译：机器能否像人类一样思考？如果能，我们应如何教导它们？早在1950年，艾伦·图灵就提出，我们应当像教育儿童那样教导机器。基于人类反馈的强化学习（RLHF）已成为一种使智能体能够以自然方式从人类反馈中学习的有力候选方法。RLHF区别于传统强化学习之处在于，它除了提供奖励信号外，还引入人类教师的反馈。该技术因多个知名人工智能应用的亮相而进入公众视野，包括OpenAI的ChatGPT、DeepMind的Sparrow和Anthropic的Claude。这些能力强大的聊天机器人正在颠覆我们对人工智能与人类交互方式的理解。RLHF的广泛适用性与快速成功强烈激励着对其社会影响的评估。基于最新进展，本文思考一个重要问题：RLHF能否在不产生负面影响的情况下被开发和应用？我们的目标有三：系统研究RLHF的社会效应；识别RLHF的关键社会与伦理问题；讨论对利益相关者的社会影响。尽管基于文本的RLHF应用备受关注，但在评估其社会影响时，必须考虑其可能部署的多样化领域。我们描述了RLHF技术通过积极转变人类与人工智能的互动体验，从而影响社会的七种主要方式。本文最终提出，RLHF在虚假信息、人工智能价值对齐、偏见、人工智能访问、跨文化对话、行业和劳动力领域具有产生净积极影响的潜力。由于RLHF引发了与现有人工智能技术相似的担忧，所有人都有必要在采用RLHF时保持意识并审慎决策。