We resolve the open problem of designing a computationally efficient algorithm for infinite-horizon average-reward linear Markov Decision Processes (MDPs) with $\widetilde{O}(\sqrt{T})$ regret. Previous approaches with $\widetilde{O}(\sqrt{T})$ regret either suffer from computational inefficiency or require strong assumptions on dynamics, such as ergodicity. In this paper, we approximate the average-reward setting by the discounted setting and show that running an optimistic value iteration-based algorithm for learning the discounted setting achieves $\widetilde{O}(\sqrt{T})$ regret when the discounting factor $\gamma$ is tuned appropriately. The challenge in the approximation approach is to get a regret bound with a sharp dependency on the effective horizon $1 / (1 - \gamma)$. We use a computationally efficient clipping operator that constrains the span of the optimistic state value function estimate to achieve a sharp regret bound in terms of the effective horizon, which leads to $\widetilde{O}(\sqrt{T})$ regret.
翻译:我们解决了为无限视野平均奖励线性马尔可夫决策过程设计具有$\widetilde{O}(\sqrt{T})$遗憾界且计算高效算法的开放性问题。先前达到$\widetilde{O}(\sqrt{T})$遗憾界的方法要么存在计算效率低下的问题,要么需要对动态特性(如遍历性)施加强假设。本文通过折扣设定来逼近平均奖励设定,并证明当折扣因子$\gamma$经过适当调整时,运行基于乐观值迭代的算法来学习折扣设定,即可实现$\widetilde{O}(\sqrt{T})$的遗憾界。逼近方法中的挑战在于获得一个对有效视野$1 / (1 - \gamma)$具有敏锐依赖性的遗憾界。我们采用一种计算高效的截断算子,该算子约束了乐观状态值函数估计的跨度,从而实现了关于有效视野的敏锐遗憾界,最终得到$\widetilde{O}(\sqrt{T})$的遗憾。