We consider online reinforcement learning (RL) in episodic Markov decision processes (MDPs) under the linear $q^\pi$-realizability assumption, where it is assumed that the action-values of all policies can be expressed as linear functions of state-action features. This class is known to be more general than linear MDPs, where the transition kernel and the reward function are assumed to be linear functions of the feature vectors. As our first contribution, we show that the difference between the two classes is the presence of states in linearly $q^\pi$-realizable MDPs where for any policy, all the actions have approximately equal values, and skipping over these states by following an arbitrarily fixed policy in those states transforms the problem to a linear MDP. Based on this observation, we derive a novel (computationally inefficient) learning algorithm for linearly $q^\pi$-realizable MDPs that simultaneously learns what states should be skipped over and runs another learning algorithm on the linear MDP hidden in the problem. The method returns an $\epsilon$-optimal policy after $\text{polylog}(H, d)/\epsilon^2$ interactions with the MDP, where $H$ is the time horizon and $d$ is the dimension of the feature vectors, giving the first polynomial-sample-complexity online RL algorithm for this setting. The results are proved for the misspecified case, where the sample complexity is shown to degrade gracefully with the misspecification error.
翻译:我们考虑在线强化学习在情节式马尔可夫决策过程中的应用,基于线性 $q^\pi$-可实现性假设,即所有策略的动作-价值函数均可表示为状态-动作特征的线性函数。该假设比线性MDP(转移核与奖励函数均为特征向量的线性函数)更具一般性。我们首先证明,这两类问题的核心差异在于:在线性 $q^\pi$-可实现的MDP中,存在某些状态,对于任意策略,所有动作的价值近似相等;若在这些状态下采用任意固定策略跳过这些状态,问题将转化为线性MDP。基于这一发现,我们提出一种新颖(但非计算高效)的线性 $q^\pi$-可实现MDP学习算法,该算法能同时学习哪些状态应被跳过,并在此过程中对隐藏的线性MDP执行另一学习算法。该方法在与MDP交互 $\text{polylog}(H, d)/\epsilon^2$ 次后即可返回一个 $\epsilon$-最优策略,其中 $H$ 为时间步数,$d$ 为特征向量维度。这是此类设置下首个具有多项式样本复杂度的在线强化学习算法。此外,我们证明了该结果在模型误设情形下仍成立,且样本复杂度随误设误差优雅退化。