Inverse reinforcement learning (IRL) is an imitation learning approach to learning reward functions from expert demonstrations. Its use avoids the difficult and tedious procedure of manual reward specification while retaining the generalization power of reinforcement learning. In IRL, the reward is usually represented as a linear combination of features. In continuous state spaces, the state variables alone are not sufficiently rich to be used as features, but which features are good is not known in general. To address this issue, we propose a method that employs polynomial basis functions to form a candidate set of features, which are shown to allow the matching of statistical moments of state distributions. Feature selection is then performed for the candidates by leveraging the correlation between trajectory probabilities and feature expectations. We demonstrate the approach's effectiveness by recovering reward functions that capture expert policies across non-linear control tasks of increasing complexity. Code, data, and videos are available at https://sites.google.com/view/feature4irl.
翻译:逆强化学习(IRL)是一种从专家示范中学习奖励函数的模仿学习方法。该方法避免了手动指定奖励项这一困难且繁琐的过程,同时保留了强化学习的泛化能力。在逆强化学习中,奖励函数通常表示为特征的线性组合。在连续状态空间中,仅凭状态变量本身不足以作为有效特征,但究竟哪些特征具有良好性能通常未知。针对这一问题,我们提出一种采用多项式基函数构建候选特征集的方法,理论证明该方法能实现状态分布统计矩的匹配。通过利用轨迹概率与特征期望之间的相关性,对候选特征进行选择。在复杂度递增的非线性控制任务中,我们通过恢复能够捕捉专家策略的奖励函数,验证了该方法的有效性。代码、数据及视频见https://sites.google.com/view/feature4irl。