We present a new algorithm for imitation learning in infinite horizon linear MDPs dubbed ILARL which greatly improves the bound on the number of trajectories that the learner needs to sample from the environment. In particular, we remove exploration assumptions required in previous works and we improve the dependence on the desired accuracy $\epsilon$ from $\mathcal{O}\br{\epsilon^{-5}}$ to $\mathcal{O}\br{\epsilon^{-4}}$. Our result relies on a connection between imitation learning and online learning in MDPs with adversarial losses. For the latter setting, we present the first result for infinite horizon linear MDP which may be of independent interest. Moreover, we are able to provide a strengthen result for the finite horizon case where we achieve $\mathcal{O}\br{\epsilon^{-2}}$. Numerical experiments with linear function approximation shows that ILARL outperforms other commonly used algorithms.
翻译:我们提出了一种针对无限时域线性MDP的模仿学习新算法——ILARL,该算法显著降低了学习器从环境中采样所需轨迹数量的界。具体而言,我们消除了先前工作中所需的探索假设,并将对期望精度$\epsilon$的依赖从$\mathcal{O}\br{\epsilon^{-5}}$改进至$\mathcal{O}\br{\epsilon^{-4}}$。这一结果依赖于模仿学习与对抗性损失MDP中在线学习之间的联系。对于后者,我们首次给出了无限时域线性MDP的结果,这本身可能具有独立研究价值。此外,我们还在有限时域情形下获得了强化结果,实现了$\mathcal{O}\br{\epsilon^{-2}}$的界。使用线性函数逼近的数值实验表明,ILARL优于其他常用算法。