Many machine learning tasks can be solved by minimizing a convex function of an occupancy measure over the policies that generate them. These include reinforcement learning, imitation learning, among others. This more general paradigm is called the Concave Utility Reinforcement Learning problem (CURL). Since CURL invalidates classical Bellman equations, it requires new algorithms. We introduce MD-CURL, a new algorithm for CURL in a finite horizon Markov decision process. MD-CURL is inspired by mirror descent and uses a non-standard regularization to achieve convergence guarantees and a simple closed-form solution, eliminating the need for computationally expensive projection steps typically found in mirror descent approaches. We then extend CURL to an online learning scenario and present Greedy MD-CURL, a new method adapting MD-CURL to an online, episode-based setting with partially unknown dynamics. Like MD-CURL, the online version Greedy MD-CURL benefits from low computational complexity, while guaranteeing sub-linear or even logarithmic regret, depending on the level of information available on the underlying dynamics.
翻译:许多机器学习任务可以通过最小化策略生成占用测度的凸函数来解决,这包括强化学习、模仿学习等。这种更通用的范式被称为凹效用强化学习问题(CURL)。由于CURL使经典贝尔曼方程失效,因此需要设计新算法。我们提出MD-CURL,一种适用于有限时域马尔可夫决策过程的CURL新算法。MD-CURL受镜像下降启发,采用非标准化正则化实现收敛保证和简单闭式解,消除了镜像下降方法中通常需要的高计算开销投影步骤。随后我们将CURL扩展至在线学习场景,提出Greedy MD-CURL——一种将MD-CURL适配到部分未知动态的在线回合制环境的新方法。与MD-CURL类似,在线版本的Greedy MD-CURL在保持低计算复杂度的同时,可根据底层动态信息的可用性保证亚线性甚至对数级别的遗憾值。