Reinforcement learning algorithms commonly seek to optimize policies for solving one particular task. How should we explore an unknown dynamical system such that the estimated model globally approximates the dynamics and allows us to solve multiple downstream tasks in a zero-shot manner? In this paper, we address this challenge, by developing an algorithm -- OPAX -- for active exploration. OPAX uses well-calibrated probabilistic models to quantify the epistemic uncertainty about the unknown dynamics. It optimistically -- w.r.t. to plausible dynamics -- maximizes the information gain between the unknown dynamics and state observations. We show how the resulting optimization problem can be reduced to an optimal control problem that can be solved at each episode using standard approaches. We analyze our algorithm for general models, and, in the case of Gaussian process dynamics, we give a first-of-its-kind sample complexity bound and show that the epistemic uncertainty converges to zero. In our experiments, we compare OPAX with other heuristic active exploration approaches on several environments. Our experiments show that OPAX is not only theoretically sound but also performs well for zero-shot planning on novel downstream tasks.
翻译:强化学习算法通常旨在优化解决特定任务的策略。我们应如何探索未知动态系统,使得估计模型能够全局近似动态特性,并允许我们以零样本方式解决多个下游任务?在本文中,我们通过开发一种名为OPAX的主动探索算法来应对这一挑战。OPAX使用校准良好的概率模型来量化对未知动态的认知不确定性。它乐观地——相对于可能的动态——最大化未知动态与状态观测之间的信息增益。我们展示了如何将由此产生的优化问题简化为一个最优控制问题,该问题可以在每个回合中使用标准方法求解。我们针对通用模型分析了算法,并且在高斯过程动态的情况下,给出了首个样本复杂度界,并证明了认知不确定性收敛到零。在实验中,我们将OPAX与其他启发式主动探索方法在多个环境中进行了比较。我们的实验表明,OPAX不仅在理论上合理,而且在新型下游任务的零样本规划中表现良好。