We study the problem of reinforcement learning in infinite-horizon discounted linear Markov decision processes (MDPs), and propose the first computationally efficient algorithm achieving rate-optimal regret guarantees in this setting. Our main idea is to combine two classic techniques for optimistic exploration: additive exploration bonuses applied to the reward function, and artificial transitions made to an absorbing state with maximal return. We show that, combined with a regularized approximate dynamic-programming scheme, the resulting algorithm achieves a regret of order $\tilde{\mathcal{O}} (\sqrt{d^3 (1 - γ)^{- 7 / 2} T})$, where $T$ is the total number of sample transitions, $γ\in (0,1)$ is the discount factor, and $d$ is the feature dimensionality. The results continue to hold against adversarial reward sequences, enabling application of our method to the problem of imitation learning in linear MDPs, where we achieve state-of-the-art results.
翻译:本研究针对无限时域折扣线性马尔可夫决策过程(MDPs)中的强化学习问题,提出了首个在该设定下实现计算高效且达到最优遗憾率保证的算法。我们的核心思想是结合两种经典的乐观探索技术:应用于奖励函数的加性探索奖励,以及向具有最大回报的吸收状态构建的人工转移。研究表明,结合正则化近似动态规划框架,所得算法可实现 $\tilde{\mathcal{O}} (\sqrt{d^3 (1 - γ)^{- 7 / 2} T})$ 量级的遗憾界,其中 $T$ 为样本转移总数,$γ\in (0,1)$ 为折扣因子,$d$ 为特征维度。该结果在对抗性奖励序列下依然成立,使得我们的方法可应用于线性 MDPs 中的模仿学习问题,并在此领域取得了当前最优的性能表现。