A Markov decision process can be parameterized by a transition kernel and a reward function. Both play essential roles in the study of reinforcement learning as evidenced by their presence in the Bellman equations. In our inquiry of various kinds of "costs" associated with reinforcement learning inspired by the demands in robotic applications, rewards are central to understanding the structure of a Markov decision process and reward-centric notions can elucidate important concepts in reinforcement learning. Specifically, we study the sample complexity of policy evaluation and develop a novel estimator with an instance-specific error bound of $\tilde{O}(\sqrt{\frac{\tau_s}{n}})$ for estimating a single state value. Under the online regret minimization setting, we refine the transition-based MDP constant, diameter, into a reward-based constant, maximum expected hitting cost, and with it, provide a theoretical explanation for how a well-known technique, potential-based reward shaping, could accelerate learning with expert knowledge. In an attempt to study safe reinforcement learning, we model hazardous environments with irrecoverability and proposed a quantitative notion of safe learning via reset efficiency. In this setting, we modify a classic algorithm to account for resets achieving promising preliminary numerical results. Lastly, for MDPs with multiple reward functions, we develop a planning algorithm that computationally efficiently finds Pareto-optimal stochastic policies.
翻译:马尔可夫决策过程可通过转移核与奖励函数进行参数化。这两者均在贝尔曼方程中有所体现,在强化学习中扮演着关键角色。受机器人应用需求的启发,我们在探究强化学习中各类“代价”时发现,奖励机制对于理解马尔可夫决策过程的结构具有核心意义,而基于奖励的概念能够阐明强化学习中的重要原理。具体而言,我们研究了策略评估的样本复杂度,并提出了一种新型估计器,该估计器针对单一状态值的估计具有$\tilde{O}(\sqrt{\frac{\tau_s}{n}})$的实例相关误差界。在在线遗憾最小化框架下,我们将基于转移的马尔可夫决策过程常数——直径,优化为基于奖励的常数——最大期望到达代价,并据此从理论上阐释了经典技术“基于势能的奖励塑形”如何借助专家知识加速学习。为探索安全强化学习,我们利用不可恢复性对危险环境进行建模,并提出了基于重置效率的定量化安全学习概念。在此框架下,我们改进了经典算法以纳入重置机制,初步数值结果令人鼓舞。最后,针对具有多奖励函数的马尔可夫决策过程,我们开发了一种能够高效计算帕累托最优随机策略的规划算法。