Frozen Policy Iteration: Computationally Efficient RL under Linear $Q^π$ Realizability for Deterministic Dynamics

We study computationally and statistically efficient reinforcement learning under the linear $Q^π$ realizability assumption, where any policy's $Q$-function is linear in a given state-action feature representation. Prior methods in this setting are either computationally intractable, or require (local) access to a simulator. In this paper, we propose a computationally efficient online RL algorithm, named Frozen Policy Iteration, under the linear $Q^π$ realizability setting that works for Markov Decision Processes (MDPs) with stochastic initial states, stochastic rewards and deterministic transitions. Our algorithm achieves a regret bound of $\widetilde{O}(\sqrt{d^2H^6T})$, where $d$ is the dimensionality of the feature space, $H$ is the horizon length, and $T$ is the total number of episodes. Our regret bound is optimal for linear (contextual) bandits which is a special case of our setting with $H = 1$. Existing policy iteration algorithms under the same setting heavily rely on repeatedly sampling the same state by access to the simulator, which is not implementable in the online setting with stochastic initial states studied in this paper. In contrast, our new algorithm circumvents this limitation by strategically using only high-confidence part of the trajectory data and freezing the policy for well-explored states, which ensures that all data used by our algorithm remains effectively on-policy during the whole course of learning. We further demonstrate the versatility of our approach by extending it to the Uniform-PAC setting and to function classes with bounded eluder dimension.

翻译：本研究在线性$Q^π$可实现性假设下，探讨计算与统计效率兼备的强化学习方法。该假设要求任意策略的$Q$函数在给定的状态-动作特征表示中呈线性关系。现有方法在此设定下要么计算不可行，要么需要（局部）模拟器支持。本文针对具有随机初始状态、随机奖励及确定性转移的马尔可夫决策过程，提出一种名为冻结策略迭代的计算高效在线强化学习算法。该算法实现了$\widetilde{O}(\sqrt{d^2H^6T})$的遗憾界，其中$d$为特征空间维度，$H$为时间跨度，$T$为总回合数。该遗憾界对于线性（上下文）赌博机问题（即$H=1$的本研究特例）是最优的。现有同设定下的策略迭代算法严重依赖通过模拟器对同一状态的重复采样，这在本研究所探讨的随机初始状态在线设定中无法实现。相比之下，新算法通过策略性地仅使用轨迹数据的高置信度部分并对充分探索的状态实施策略冻结，有效规避了这一限制，从而确保算法在整个学习过程中使用的所有数据始终保持有效的同策略性质。我们进一步通过将方法扩展至Uniform-PAC设定及有限eluder维度的函数类，展示了该框架的通用性。