Maximising a cumulative reward function that is Markov and stationary, i.e., defined over state-action pairs and independent of time, is sufficient to capture many kinds of goals in a Markov decision process (MDP). However, not all goals can be captured in this manner. In this paper we study convex MDPs in which goals are expressed as convex functions of the stationary distribution and show that they cannot be formulated using stationary reward functions. Convex MDPs generalize the standard reinforcement learning (RL) problem formulation to a larger framework that includes many supervised and unsupervised RL problems, such as apprenticeship learning, constrained MDPs, and so-called `pure exploration'. Our approach is to reformulate the convex MDP problem as a min-max game involving policy and cost (negative reward) `players', using Fenchel duality. We propose a meta-algorithm for solving this problem and show that it unifies many existing algorithms in the literature.
翻译:最大化一个马尔可夫且平稳的累积奖励函数(即定义在状态-动作对上且与时间无关)足以捕捉马尔可夫决策过程(MDP)中的多种目标。然而,并非所有目标都能以这种方式被捕获。本文研究了以平稳分布的凸函数形式表达目标的凸MDP问题,并证明此类目标无法通过平稳奖励函数加以表述。凸MDP将标准强化学习(RL)问题框架推广至更广泛的体系,涵盖诸如学徒学习、约束MDP及所谓的"纯探索"等监督与无监督RL问题。我们采用Fenchel对偶理论,将凸MDP问题重构为涉及策略与成本(负奖励)"玩家"的极小极大博弈。针对该问题,我们提出一种元算法,并证明其可统一文献中许多现有算法。