The classic Reinforcement Learning (RL) formulation concerns the maximization of a scalar reward function. More recently, convex RL has been introduced to extend the RL formulation to all the objectives that are convex functions of the state distribution induced by a policy. Notably, convex RL covers several relevant applications that do not fall into the scalar formulation, including imitation learning, risk-averse RL, and pure exploration. In classic RL, it is common to optimize an infinite trials objective, which accounts for the state distribution instead of the empirical state visitation frequencies, even though the actual number of trajectories is always finite in practice. This is theoretically sound since the infinite trials and finite trials objectives can be proved to coincide and thus lead to the same optimal policy. In this paper, we show that this hidden assumption does not hold in the convex RL setting. In particular, we show that erroneously optimizing the infinite trials objective in place of the actual finite trials one, as it is usually done, can lead to a significant approximation error. Since the finite trials setting is the default in both simulated and real-world RL, we believe shedding light on this issue will lead to better approaches and methodologies for convex RL, impacting relevant research areas such as imitation learning, risk-averse RL, and pure exploration among others.
翻译:经典强化学习(RL)框架涉及标量奖励函数的最大化。近年来,凸强化学习被引入,将RL框架扩展至所有以策略诱导的状态分布凸函数为目标的问题。值得注意的是,凸强化学习涵盖了几类不适用于标量框架的相关应用,包括模仿学习、风险规避RL和纯探索。在经典RL中,通常优化无限试验目标函数,该目标函数考虑状态分布而非经验状态访问频率,尽管实际中轨迹数量始终有限。这一做法在理论上是合理的,因为无限试验与有限试验目标函数可被证明等价,从而导向相同的最优策略。本文表明,这一隐含假设在凸强化学习设定中并不成立。具体而言,我们指出,错误地将无限试验目标函数替代实际有限试验目标函数进行优化(这是当前常见做法),可能导致显著近似误差。由于有限试验设定是模拟及真实环境RL中的默认场景,我们相信揭示这一问题将有助于发展更优的凸强化学习方法和理论,并影响模仿学习、风险规避RL和纯探索等关键研究领域。