In Reinforcement Learning (RL), the goal of agents is to discover an optimal policy that maximizes the expected cumulative rewards. This objective may also be viewed as finding a policy that optimizes a linear function of its state-action occupancy measure, hereafter referred as Linear RL. However, many supervised and unsupervised RL problems are not covered in the Linear RL framework, such as apprenticeship learning, pure exploration and variational intrinsic control, where the objectives are non-linear functions of the occupancy measures. RL with non-linear utilities looks unwieldy, as methods like Bellman equation, value iteration, policy gradient, dynamic programming that had tremendous success in Linear RL, fail to trivially generalize. In this paper, we derive the policy gradient theorem for RL with general utilities. The policy gradient theorem proves to be a cornerstone in Linear RL due to its elegance and ease of implementability. Our policy gradient theorem for RL with general utilities shares the same elegance and ease of implementability. Based on the policy gradient theorem derived, we also present a simple sample-based algorithm. We believe our results will be of interest to the community and offer inspiration to future works in this generalized setting.
翻译:在强化学习中,智能体的目标是发现一个能够最大化期望累积回报的最优策略。这一目标也可视为寻找一个能优化其状态-动作占用测度线性函数的策略(以下简称线性强化学习)。然而,许多有监督和无监督的强化学习问题并未被线性强化学习框架所涵盖,例如学徒学习、纯探索和变分内在控制,这些问题的目标函数是占用测度的非线性函数。具有非线性效用的强化学习显得棘手,因为贝尔曼方程、值迭代、策略梯度、动态规划等在线性强化学习中取得巨大成功的方法无法直接推广。本文推导了面向通用效用的强化学习策略梯度定理。策略梯度定理因其优雅性和易于实现性而成为线性强化学习的基石。我们提出的面向通用效用的强化学习策略梯度定理同样具有优雅性和易于实现性。基于所推导的策略梯度定理,我们还提出了一种简单的基于样本的算法。我们相信这一结果将引起学界的兴趣,并为这一泛化设定下的未来研究提供启发。