Imitation Learning (IL) is an important paradigm within the broader reinforcement learning (RL) methodology. Unlike most of RL, it does not assume availability of reward-feedback. Reward inference and shaping are known to be difficult and error-prone methods particularly when the demonstration data comes from human experts. Classical methods such as behavioral cloning and inverse reinforcement learning are highly sensitive to estimation errors, a problem that is particularly acute in continuous state space problems. Meanwhile, state-of-the-art IL algorithms convert behavioral policy learning problems into distribution-matching problems which often require additional online interaction data to be effective. In this paper, we consider the problem of imitation learning in continuous state space environments based solely on observed behavior, without access to transition dynamics information, reward structure, or, most importantly, any additional interactions with the environment. Our approach is based on the Markov balance equation and introduces a novel conditional kernel density estimation-based imitation learning framework. It involves estimating the environment's transition dynamics using conditional kernel density estimators and seeks to satisfy the probabilistic balance equations for the environment. We establish that our estimators satisfy basic asymptotic consistency requirements. Through a series of numerical experiments on continuous state benchmark environments, we show consistently superior empirical performance over many state-of-the-art IL algorithms.
翻译:模仿学习(Imitation Learning, IL)是强化学习(Reinforcement Learning, RL)方法论中一个重要的范式。与大多数强化学习不同,它不假设能获得奖励反馈。已知当示范数据来自人类专家时,奖励推断与塑形是困难且容易出错的方法。传统方法如行为克隆和逆强化学习对估计误差高度敏感,这一问题在连续状态空间问题中尤为突出。与此同时,最新的模仿学习算法将行为策略学习问题转化为分布匹配问题,这些算法通常需要额外的在线交互数据才能有效。本文研究仅基于观测到的行为(无法获取转移动态信息、奖励结构,尤其是无法与环境进行任何额外交互)在连续状态空间环境中的模仿学习问题。我们的方法基于马尔可夫平衡方程,并引入了一种新颖的基于条件核密度估计的模仿学习框架。它利用条件核密度估计器估计环境的转移动态,并力求满足环境的概率平衡方程。我们证明了所提出的估计量满足基本的渐近一致性要求。通过在连续状态基准环境上的一系列数值实验,我们展示了该方法在经验性能上始终优于许多最新的模仿学习算法。