Directed Exploration is a crucial challenge in reinforcement learning (RL), especially when rewards are sparse. Information-directed sampling (IDS), which optimizes the information ratio, seeks to do so by augmenting regret with information gain. However, estimating information gain is computationally intractable or relies on restrictive assumptions which prohibit its use in many practical instances. In this work, we posit an alternative exploration incentive in terms of the integral probability metric (IPM) between a current estimate of the transition model and the unknown optimal, which under suitable conditions, can be computed in closed form with the kernelized Stein discrepancy (KSD). Based on KSD, we develop a novel algorithm \algo: \textbf{STE}in information dir\textbf{E}cted exploration for model-based \textbf{R}einforcement Learn\textbf{ING}. To enable its derivation, we develop fundamentally new variants of KSD for discrete conditional distributions. {We further establish that {\algo} archives sublinear Bayesian regret, improving upon prior learning rates of information-augmented MBRL.} Experimentally, we show that the proposed algorithm is computationally affordable and outperforms several prior approaches.
翻译:导向探索是强化学习中的一个关键挑战,尤其在奖励稀疏的情况下。信息导向采样通过优化信息比率,旨在以信息增益扩充遗憾来实现这一目标。然而,估计信息增益在计算上难以处理,或依赖于限制性假设,这阻碍了其在许多实际场景中的应用。在本工作中,我们提出了一种替代性探索激励,基于转移模型当前估计与未知最优模型之间的积分概率度量,在适当条件下,该度量可通过核化斯坦因散度以闭式形式计算。基于KSD,我们开发了一种新算法\algo:用于模型强化学习的\textbf{STE}in信息导向\textbf{E}探索\textbf{R}einforcement Learn\textbf{ING}。为推导该算法,我们为离散条件分布发展了根本性的KSD新变体。我们进一步证明{\algo}实现了次线性贝叶斯遗憾,改进了先前信息增强型模型强化学习的学习速率。实验表明,所提出算法计算成本可控,且优于多种先前方法。