The goal of Bayesian inverse reinforcement learning (IRL) is recovering a posterior distribution over reward functions using a set of demonstrations from an expert optimizing for a reward unknown to the learner. The resulting posterior over rewards can then be used to synthesize an apprentice policy that performs well on the same or a similar task. A key challenge in Bayesian IRL is bridging the computational gap between the hypothesis space of possible rewards and the likelihood, often defined in terms of Q values: vanilla Bayesian IRL needs to solve the costly forward planning problem - going from rewards to the Q values - at every step of the algorithm, which may need to be done thousands of times. We propose to solve this by a simple change: instead of focusing on primarily sampling in the space of rewards, we can focus on primarily working in the space of Q-values, since the computation required to go from Q-values to reward is radically cheaper. Furthermore, this reversion of the computation makes it easy to compute the gradient allowing efficient sampling using Hamiltonian Monte Carlo. We propose ValueWalk - a new Markov chain Monte Carlo method based on this insight - and illustrate its advantages on several tasks.
翻译:贝叶斯逆强化学习(IRL)的目标是利用专家在未知奖励函数下优化产生的一组演示,恢复奖励函数的后验分布。由此得到的奖励后验可用于合成在相同或类似任务上表现优异的学徒策略。贝叶斯IRL的核心挑战在于连接可能奖励的假设空间与通常以Q值定义的似然函数之间的计算鸿沟:传统贝叶斯IRL需要在算法的每一步(可能需要执行数千次)解决从奖励到Q值的代价高昂的前向规划问题。我们提出通过一个简单改变来解决此问题:与其主要在奖励空间中进行采样,我们可以将重点转向在Q值空间中进行主要计算,因为从Q值到奖励所需的计算成本显著降低。此外,这种计算顺序的逆转使得梯度计算变得容易,从而能够使用哈密顿蒙特卡洛方法进行高效采样。基于这一洞见,我们提出ValueWalk——一种新的马尔可夫链蒙特卡洛方法,并在多个任务中展示了其优势。