Building systems that are good for society in the face of complex societal effects requires a dynamic approach. Recent approaches to machine learning (ML) documentation have demonstrated the promise of discursive frameworks for deliberation about these complexities. However, these developments have been grounded in a static ML paradigm, leaving the role of feedback and post-deployment performance unexamined. Meanwhile, recent work in reinforcement learning has shown that the effects of feedback and optimization objectives on system behavior can be wide-ranging and unpredictable. In this paper we sketch a framework for documenting deployed and iteratively updated learning systems, which we call Reward Reports. Taking inspiration from various contributions to the technical literature on reinforcement learning, we outline Reward Reports as living documents that track updates to design choices and assumptions behind what a particular automated system is optimizing for. They are intended to track dynamic phenomena arising from system deployment, rather than merely static properties of models or data. After presenting the elements of a Reward Report, we discuss a concrete example: Meta's BlenderBot 3 chatbot. Several others for game-playing (DeepMind's MuZero), content recommendation (MovieLens), and traffic control (Project Flow) are included in the appendix.
翻译:构建对社会有益的系统需要考虑复杂的社会效应,这需要一种动态的方法。近期机器学习文档化方法展示了运用讨论框架来审慎处理这些复杂性的前景。然而,这些发展基于静态机器学习范式,忽略了反馈与部署后性能的作用。与此同时,强化学习的最新研究表明,反馈和优化目标对系统行为的影响可能广泛且不可预测。本文勾勒了一个用于记录已部署并迭代更新的学习系统的框架,我们称之为奖励报告。借鉴强化学习技术文献中的多项贡献,我们将奖励报告定义为动态文档,用于追踪特定自动化系统优化目标背后的设计选择与假设的更新。其旨在记录系统部署产生的动态现象,而非仅关注模型或数据的静态属性。在介绍奖励报告的组成要素后,我们讨论了一个具体实例:Meta公司的BlenderBot 3聊天机器人。附录中还包含了其他几个案例:游戏领域(DeepMind的MuZero)、内容推荐(MovieLens)以及交通控制(Project Flow)。