Training a deep neural network to maximize a target objective has become the standard recipe for successful machine learning over the last decade. These networks can be optimized with supervised learning, if the target objective is differentiable. For many interesting problems, this is however not the case. Common objectives like intersection over union (IoU), bilingual evaluation understudy (BLEU) score or rewards cannot be optimized with supervised learning. A common workaround is to define differentiable surrogate losses, leading to suboptimal solutions with respect to the actual objective. Reinforcement learning (RL) has emerged as a promising alternative for optimizing deep neural networks to maximize non-differentiable objectives in recent years. Examples include aligning large language models via human feedback, code generation, object detection or control problems. This makes RL techniques relevant to the larger machine learning audience. The subject is, however, time intensive to approach due to the large range of methods, as well as the often very theoretical presentation. In this introduction, we take an alternative approach, different from classic reinforcement learning textbooks. Rather than focusing on tabular problems, we introduce reinforcement learning as a generalization of supervised learning, which we first apply to non-differentiable objectives and later to temporal problems. Assuming only basic knowledge of supervised learning, the reader will be able to understand state-of-the-art deep RL algorithms like proximal policy optimization (PPO) after reading this tutorial.
翻译:训练深度神经网络以最大化目标函数已成为过去十年机器学习成功的标准方法。若目标函数可微,这些网络可通过监督学习进行优化。然而,对于许多有趣的问题,情况并非如此。常见的指标如交并比(IoU)、双语评估替补(BLEU)分数或奖励值无法通过监督学习直接优化。常见的解决方法是定义可微的替代损失函数,但这会得到偏离实际目标次优解。近年来,强化学习(RL)作为优化深度神经网络以最大化不可微目标函数的有力替代方案崭露头角。典型应用包括通过人类反馈对齐大语言模型、代码生成、目标检测或控制问题。这使得RL技术对更广泛的机器学习受众具有重要价值。然而,由于方法种类繁多且常以高度理论化的形式呈现,该领域的学习门槛较高。本导论采用与经典强化学习教材不同的方法:我们不聚焦于表格型问题,而是将强化学习视为监督学习的泛化形式,首先应用于不可微目标函数,再推广到时序问题。读者仅需具备监督学习基础知识,即可通过本教程理解诸如近端策略优化(PPO)等前沿深度强化学习算法。