A Markov decision process can be parameterized by a transition kernel and a reward function. Both play essential roles in the study of reinforcement learning as evidenced by their presence in the Bellman equations. In our inquiry of various kinds of ``costs'' associated with reinforcement learning inspired by the demands in robotic applications, rewards are central to understanding the structure of a Markov decision process and reward-centric notions can elucidate important concepts in reinforcement learning. Specifically, we studied the sample complexity of policy evaluation and developed a novel estimator with an instance-specific error bound of $\tilde{O}(\sqrt{\frac{\tau_s}{n}})$ for estimating a single state value. Under the online regret minimization setting, we refined the transition-based MDP constant, diameter, into a reward-based constant, maximum expected hitting cost, and with it, provided a theoretical explanation for how a well-known technique, potential-based reward shaping, could accelerate learning with expert knowledge. In an attempt to study safe reinforcement learning, we modeled hazardous environments with irrecoverability and proposed a quantitative notion of safe learning via reset efficiency. In this setting, we modified a classic algorithm to account for resets achieving promising preliminary numerical results. Lastly, for MDPs with multiple reward functions, we developed a planning algorithm that computationally efficiently finds Pareto optimal stochastic policies.
翻译:马尔可夫决策过程可由转移核与奖励函数参数化。两者在强化学习研究中均扮演核心角色,这一点从它们在贝尔曼方程中的出现即可见一斑。受机器人应用需求启发,在对强化学习相关的各类“成本”进行探究时,奖励机制对于理解马尔可夫决策过程的结构至关重要,且以奖励为中心的概念可阐明强化学习中的关键思想。具体而言,我们研究了策略评估的样本复杂度,并针对单状态值估计开发了一种新颖的估计器,其具有实例特定的误差界 $\tilde{O}(\sqrt{\frac{\tau_s}{n}})$。在线遗憾最小化设置中,我们将基于转移的马尔可夫决策过程常数——直径,改进为基于奖励的常数——最大期望到达成本,并借此为著名技术“基于势能的奖励塑形”如何利用专家知识加速学习提供了理论解释。为探索安全强化学习,我们以不可恢复性对危险环境进行建模,并通过重置效率提出了安全学习的量化概念。在此设置下,我们修改了经典算法以纳入重置机制,获得了初步的数值结果。最后,针对具有多奖励函数的马尔可夫决策过程,我们开发了一种规划算法,可高效计算帕累托最优随机策略。