Reinforcement learning algorithms commonly seek to optimize policies for solving one particular task. How should we explore an unknown dynamical system such that the estimated model allows us to solve multiple downstream tasks in a zero-shot manner? In this paper, we address this challenge, by developing an algorithm -- OPAX -- for active exploration. OPAX uses well-calibrated probabilistic models to quantify the epistemic uncertainty about the unknown dynamics. It optimistically -- w.r.t. to plausible dynamics -- maximizes the information gain between the unknown dynamics and state observations. We show how the resulting optimization problem can be reduced to an optimal control problem that can be solved at each episode using standard approaches. We analyze our algorithm for general models, and, in the case of Gaussian process dynamics, we give a sample complexity bound and show that the epistemic uncertainty converges to zero. In our experiments, we compare OPAX with other heuristic active exploration approaches on several environments. Our experiments show that OPAX is not only theoretically sound but also performs well for zero-shot planning on novel downstream tasks.
翻译:强化学习算法通常旨在优化解决特定任务的策略。我们该如何探索一个未知的动态系统,使得估计的模型能够以零样本学习的方式解决多个下游任务?在本文中,我们通过开发一种主动探索算法——OPAX来应对这一挑战。OPAX使用经过良好校准的概率模型来量化对未知动态的认知不确定性。它基于可能的动态,乐观地最大化未知动态与状态观测之间的信息增益。我们展示了如何将由此产生的优化问题简化为一个最优控制问题,该问题可在每个回合中使用标准方法求解。我们针对一般模型分析了算法,并在高斯过程动态的情况下给出了样本复杂度界,证明认知不确定性收敛到零。在实验中,我们将OPAX与其他启发式主动探索方法在多个环境中进行了比较。实验表明,OPAX不仅在理论上成立,而且在新型下游任务的零样本规划中表现良好。