We study reinforcement learning with linear function approximation and adversarially changing cost functions, a setup that has mostly been considered under simplifying assumptions such as full information feedback or exploratory conditions.We present a computationally efficient policy optimization algorithm for the challenging general setting of unknown dynamics and bandit feedback, featuring a combination of mirror-descent and least squares policy evaluation in an auxiliary MDP used to compute exploration bonuses.Our algorithm obtains an $\widetilde O(K^{6/7})$ regret bound, improving significantly over previous state-of-the-art of $\widetilde O (K^{14/15})$ in this setting. In addition, we present a version of the same algorithm under the assumption a simulator of the environment is available to the learner (but otherwise no exploratory assumptions are made), and prove it obtains state-of-the-art regret of $\widetilde O (K^{2/3})$.
翻译:我们研究了带有线性函数逼近和对抗性变化成本函数的强化学习问题,该设置通常在全信息反馈或探索性条件等简化假设下被考虑。我们提出了一种计算高效的政策优化算法,适用于未知动力学和赌博反馈这一具有挑战性的通用场景,该算法结合了镜像下降与辅助马尔可夫决策过程中的最小二乘政策评估,用于计算探索奖励。我们的算法获得了$\widetilde O(K^{6/7})$的遗憾界,在该设置下显著优于先前最佳结果$\widetilde O(K^{14/15})$。此外,我们提出了同一算法的一个版本,假设学习者可以访问环境模拟器(但未做其他探索性假设),并证明其获得了$\widetilde O(K^{2/3})$的最优遗憾。