Learning to control unknown nonlinear dynamical systems is a fundamental problem in reinforcement learning and control theory. A commonly applied approach is to first explore the environment (exploration), learn an accurate model of it (system identification), and then compute an optimal controller with the minimum cost on this estimated system (policy optimization). While existing work has shown that it is possible to learn a uniformly good model of the system~\citep{mania2020active}, in practice, if we aim to learn a good controller with a low cost on the actual system, certain system parameters may be significantly more critical than others, and we therefore ought to focus our exploration on learning such parameters. In this work, we consider the setting of nonlinear dynamical systems and seek to formally quantify, in such settings, (a) which parameters are most relevant to learning a good controller, and (b) how we can best explore so as to minimize uncertainty in such parameters. Inspired by recent work in linear systems~\citep{wagenmaker2021task}, we show that minimizing the controller loss in nonlinear systems translates to estimating the system parameters in a particular, task-dependent metric. Motivated by this, we develop an algorithm able to efficiently explore the system to reduce uncertainty in this metric, and prove a lower bound showing that our approach learns a controller at a near-instance-optimal rate. Our algorithm relies on a general reduction from policy optimization to optimal experiment design in arbitrary systems, and may be of independent interest. We conclude with experiments demonstrating the effectiveness of our method in realistic nonlinear robotic systems.
翻译:学习控制未知非线性动力系统是强化学习与控制理论中的一个基本问题。常用的方法是先探索环境(探索),学习其精确模型(系统辨识),然后基于该估计系统计算最小化成本的最优控制器(策略优化)。尽管已有研究表明,可以学习系统的统一良好模型~\citep{mania2020active},但在实际中,如果我们的目标是在真实系统上学习低成本的良好控制器,某些系统参数可能比其他参数更为关键,因此我们应集中探索以学习这些参数。本文考虑非线性动力系统的设定,并试图在此类设定中正式量化:(a)哪些参数与学习良好控制器最相关;(b)如何最佳探索以最小化这些参数的不确定性。受线性系统最新研究的启发~\citep{wagenmaker2021task},我们证明在非线性系统中最小化控制器损失等价于在特定任务相关度量下估计系统参数。基于此,我们开发了一种算法,能够高效探索系统以减少该度量下的不确定性,并证明了我们的方法能以接近实例最优的速率学习控制器。该算法依赖于一种从策略优化到任意系统中最优实验设计的通用约简方法,可能具有独立的研究价值。最后,通过实验在真实非线性机器人系统中验证了我们方法的有效性。