This work develops new algorithms with rigorous efficiency guarantees for infinite horizon imitation learning (IL) with linear function approximation without restrictive coherence assumptions. We begin with the minimax formulation of the problem and then outline how to leverage classical tools from optimization, in particular, the proximal-point method (PPM) and dual smoothing, for online and offline IL, respectively. Thanks to PPM, we avoid nested policy evaluation and cost updates for online IL appearing in the prior literature. In particular, we do away with the conventional alternating updates by the optimization of a single convex and smooth objective over both cost and Q-functions. When solved inexactly, we relate the optimization errors to the suboptimality of the recovered policy. As an added bonus, by re-interpreting PPM as dual smoothing with the expert policy as a center point, we also obtain an offline IL algorithm enjoying theoretical guarantees in terms of required expert trajectories. Finally, we achieve convincing empirical performance for both linear and neural network function approximation.
翻译:本文提出了在无限时域模仿学习中具有严格效率保证的新算法,该算法采用线性函数逼近且无需限制性相干性假设。我们首先构建问题的极小极大化形式,进而阐述如何利用优化领域的经典工具——特别是近端点法和对偶平滑——分别处理在线与离线模仿学习。借助近端点法,我们避免了先前文献中在线模仿学习所需的嵌套式策略评估与代价更新。具体而言,我们通过优化一个同时包含代价函数与Q函数的凸光滑目标函数,取消了传统的交替更新机制。当采用非精确求解时,我们建立了优化误差与所得策略次优性之间的关联。作为额外优势,通过将近端点法重新诠释为以专家策略为中心点的对偶平滑方法,我们还获得了在所需专家轨迹数量方面具有理论保证的离线模仿学习算法。最后,我们在线性函数与神经网络函数逼近两种场景下均取得了令人信服的实证性能。