In this paper, we present Score-life programming, a novel theoretical approach for solving reinforcement learning problems. In contrast with classical dynamic programming-based methods, our method can search over non-stationary policy functions, and can directly compute optimal infinite horizon action sequences from a given state. The central idea in our method is the construction of a mapping between infinite horizon action sequences and real numbers in a bounded interval. This construction enables us to formulate an optimization problem for directly computing optimal infinite horizon action sequences, without requiring a policy function. We demonstrate the effectiveness of our approach by applying it to nonlinear optimal control problems. Overall, our contributions provide a novel theoretical framework for formulating and solving reinforcement learning problems.
翻译:在本文中,我们提出了一种名为“Score-life编程”的新型理论方法,用于解决强化学习问题。与经典的基于动态规划的方法不同,我们的方法能够搜索非平稳策略函数,并直接从给定状态计算最优无限时域动作序列。该方法的核心思想是构建无限时域动作序列与有界区间内实数之间的映射关系。这一构建使我们能够直接制定一个优化问题来计算最优无限时域动作序列,而无需策略函数。通过将我们的方法应用于非线性最优控制问题,我们验证了其有效性。总体而言,我们的贡献在于为强化学习问题的表述与求解提供了一个新颖的理论框架。