An appropriate reward function is of paramount importance in specifying a task in reinforcement learning (RL). Yet, it is known to be extremely challenging in practice to design a correct reward function for even simple tasks. Human-in-the-loop (HiL) RL allows humans to communicate complex goals to the RL agent by providing various types of feedback. However, despite achieving great empirical successes, HiL RL usually requires too much feedback from a human teacher and also suffers from insufficient theoretical understanding. In this paper, we focus on addressing this issue from a theoretical perspective, aiming to provide provably feedback-efficient algorithmic frameworks that take human-in-the-loop to specify rewards of given tasks. We provide an active-learning-based RL algorithm that first explores the environment without specifying a reward function and then asks a human teacher for only a few queries about the rewards of a task at some state-action pairs. After that, the algorithm guarantees to provide a nearly optimal policy for the task with high probability. We show that, even with the presence of random noise in the feedback, the algorithm only takes $\widetilde{O}(H{{\dim_{R}^2}})$ queries on the reward function to provide an $\epsilon$-optimal policy for any $\epsilon > 0$. Here $H$ is the horizon of the RL environment, and $\dim_{R}$ specifies the complexity of the function class representing the reward function. In contrast, standard RL algorithms require to query the reward function for at least $\Omega(\operatorname{poly}(d, 1/\epsilon))$ state-action pairs where $d$ depends on the complexity of the environmental transition.
翻译:恰当的奖励函数在强化学习(RL)中指定任务至关重要。然而,即使在简单任务中,设计正确的奖励函数在实践中也被证明极具挑战性。人在回路(HiL)强化学习允许人类通过提供多种类型的反馈,将复杂目标传达给RL智能体。然而,尽管取得了显著的实证成功,HiL RL通常需要人类教师提供过多反馈,且缺乏足够的理论理解。本文从理论角度聚焦解决这一问题,旨在提供可证明反馈高效的算法框架,通过人在回路机制指定给定任务的奖励。我们提出了一种基于主动学习的RL算法,该算法首先在不指定奖励函数的情况下探索环境,然后仅就某些状态-动作对的任务奖励向人类教师提出少量查询。此后,该算法保证以高概率为任务提供近乎最优的策略。研究表明,即使存在反馈中的随机噪声,该算法仅需对奖励函数进行$\widetilde{O}(H{{\dim_{R}^2}})$次查询,即可为任意$\epsilon > 0$提供$\epsilon$-最优策略。其中$H$为RL环境的决策周期,$\dim_{R}$表征表示奖励函数的函数类复杂度。相比之下,标准RL算法需要对至少$\Omega(\operatorname{poly}(d, 1/\epsilon))$个状态-动作对进行奖励函数查询,其中$d$取决于环境转移的复杂度。