Some Supervision Required: Incorporating Oracle Policies in Reinforcement Learning via Epistemic Uncertainty Metrics

An inherent problem of reinforcement learning is performing exploration of an environment through random actions, of which a large portion can be unproductive. Instead, exploration can be improved by initializing the learning policy with an existing (previously learned or hard-coded) oracle policy, offline data, or demonstrations. In the case of using an oracle policy, it can be unclear how best to incorporate the oracle policy's experience into the learning policy in a way that maximizes learning sample efficiency. In this paper, we propose a method termed Critic Confidence Guided Exploration (CCGE) for incorporating such an oracle policy into standard actor-critic reinforcement learning algorithms. More specifically, CCGE takes in the oracle policy's actions as suggestions and incorporates this information into the learning scheme when uncertainty is high, while ignoring it when the uncertainty is low. CCGE is agnostic to methods of estimating uncertainty, and we show that it is equally effective with two different techniques. Empirically, we evaluate the effect of CCGE on various benchmark reinforcement learning tasks, and show that this idea can lead to improved sample efficiency and final performance. Furthermore, when evaluated on sparse reward environments, CCGE is able to perform competitively against adjacent algorithms that also leverage an oracle policy. Our experiments show that it is possible to utilize uncertainty as a heuristic to guide exploration using an oracle in reinforcement learning. We expect that this will inspire more research in this direction, where various heuristics are used to determine the direction of guidance provided to learning.

翻译：强化学习的一个固有问题是需要通过随机动作进行环境探索，而大量随机动作可能毫无成效。为解决此问题，可通过利用现有（先前学习或硬编码的）先验策略、离线数据或示范来初始化学习策略，从而提升探索效率。当使用先验策略时，如何最优地将其经验融入学习策略以最大化样本效率仍不明确。本文提出一种名为"评论家置信引导探索"（CCGE）的方法，用于将此类先验策略融入标准演员-评论家强化学习算法。具体而言，CCGE将先验策略的动作视为建议，在不确定性较高时将其纳入学习框架，在不确定性较低时则忽略该信息。CCGE对不确定性估计方法具有无关性，实验表明其在两种不同技术下效果相当。我们通过各类基准强化学习任务评估CCGE的效果，证明该思想能提升样本效率与最终性能。此外，在稀疏奖励环境下，CCGE可与同样利用先验策略的同类算法竞争。实验表明，利用不确定性作为启发式信息引导基于先验策略的探索是可行的。我们期望该研究能启发更多相关工作，利用多种启发式方法确定学习引导方向。