Directed Exploration is a crucial challenge in reinforcement learning (RL), especially when rewards are sparse. Information-directed sampling (IDS), which optimizes the information ratio, seeks to do so by augmenting regret with information gain. However, estimating information gain is computationally intractable or relies on restrictive assumptions which prohibit its use in many practical instances. In this work, we posit an alternative exploration incentive in terms of the integral probability metric (IPM) between a current estimate of the transition model and the unknown optimal, which under suitable conditions, can be computed in closed form with the kernelized Stein discrepancy (KSD). Based on KSD, we develop a novel algorithm STEERING: \textbf{STE}in information dir\textbf{E}cted exploration for model-based \textbf{R}einforcement Learn\textbf{ING}. To enable its derivation, we develop fundamentally new variants of KSD for discrete conditional distributions. We further establish that STEERING archives sublinear Bayesian regret, improving upon prior learning rates of information-augmented MBRL, IDS included. Experimentally, we show that the proposed algorithm is computationally affordable and outperforms several prior approaches.
翻译:导向探索是强化学习中的关键挑战,尤其在奖励稀疏时更为突出。信息导向采样通过优化信息比率,旨在以信息增益增强遗憾值来实现这一目标。然而,信息增益的估计在计算上难以处理,或依赖于限制性假设,这阻碍了其在许多实际场景中的应用。本文提出了一种基于积分概率度量的替代探索激励,该度量衡量当前转移模型估计与未知最优模型之间的差异,在适当条件下可通过核化Stein散度闭式求解。基于KSD,我们开发了一种新颖算法STEERING:基于模型强化学习的**STE**in信息导向探**E**索与**R**einforcement Learn**ING**。为实现推导,我们为离散条件分布开发了全新的KSD变体。我们进一步证明STEERING实现了亚线性贝叶斯遗憾,改进了包括IDS在内的先验信息增强MBRL的学习速率。实验表明,所提算法计算成本可承受,且优于多种现有方法。