We study regret minimization in online episodic linear Markov Decision Processes, and obtain rate-optimal $\widetilde O (\sqrt K)$ regret where $K$ denotes the number of episodes. Our work is the first to establish the optimal (w.r.t.~$K$) rate of convergence in the stochastic setting with bandit feedback using a policy optimization based approach, and the first to establish the optimal (w.r.t.~$K$) rate in the adversarial setup with full information feedback, for which no algorithm with an optimal rate guarantee is currently known.
翻译:我们研究在线回合制线性马尔可夫决策过程中的遗憾最小化问题,并获得了率最优的 $\widetilde O (\sqrt K)$ 遗憾界,其中 $K$ 表示回合数。本文首次在具有赌博机反馈的随机设置中,基于策略优化方法建立了(关于 $K$ 的)最优收敛速率,并且首次在具有完全信息反馈的对抗性设置中建立了(关于 $K$ 的)最优收敛速率——目前在该设置下尚无已知算法能保证最优率。