In reinforcement learning, the objective is almost always defined as a \emph{cumulative} function over the rewards along the process. However, there are many optimal control and reinforcement learning problems in various application fields, especially in communications and networking, where the objectives are not naturally expressed as summations of the rewards. In this paper, we recognize the prevalence of non-cumulative objectives in various problems, and propose a modification to existing algorithms for optimizing such objectives. Specifically, we dive into the fundamental building block for many optimal control and reinforcement learning algorithms: the Bellman optimality equation. To optimize a non-cumulative objective, we replace the original summation operation in the Bellman update rule with a generalized operation corresponding to the objective. Furthermore, we provide sufficient conditions on the form of the generalized operation as well as assumptions on the Markov decision process under which the globally optimal convergence of the generalized Bellman updates can be guaranteed. We demonstrate the idea experimentally with the bottleneck objective, i.e., the objectives determined by the minimum reward along the process, on classical optimal control and reinforcement learning tasks, as well as on two network routing problems on maximizing the flow rates.
翻译:在强化学习中,目标几乎总是被定义为沿过程奖励的\emph{累积}函数。然而,在各个应用领域,特别是通信与网络中,存在许多最优控制和强化学习问题,其目标并非自然地表示为奖励的求和。本文认识到非累积目标在各类问题中的普遍性,并提出对现有算法进行改进以优化此类目标。具体而言,我们深入研究了众多最优控制和强化学习算法的基本构建模块:贝尔曼最优性方程。为了优化非累积目标,我们将贝尔曼更新规则中的原始求和操作替换为与该目标对应的广义操作。此外,我们提供了广义操作形式的充分条件以及马尔可夫决策过程的假设,在此框架下可保证广义贝尔曼更新的全局最优收敛性。我们通过瓶颈目标(即由过程中最小奖励决定的目标)的实例,在经典最优控制和强化学习任务中进行了实验验证,同时在两个网络路由问题(最大化流量速率)中展示了该方法的有效性。