We consider online reinforcement learning (RL) in episodic Markov decision processes (MDPs) under the linear $q^\pi$-realizability assumption, where it is assumed that the action-values of all policies can be expressed as linear functions of state-action features. This class is known to be more general than linear MDPs, where the transition kernel and the reward function are assumed to be linear functions of the feature vectors. As our first contribution, we show that the difference between the two classes is the presence of states in linearly $q^\pi$-realizable MDPs where for any policy, all the actions have approximately equal values, and skipping over these states by following an arbitrarily fixed policy in those states transforms the problem to a linear MDP. Based on this observation, we derive a novel (computationally inefficient) learning algorithm for linearly $q^\pi$-realizable MDPs that simultaneously learns what states should be skipped over and runs another learning algorithm on the linear MDP hidden in the problem. The method returns an $\epsilon$-optimal policy after $\text{polylog}(H, d)/\epsilon^2$ interactions with the MDP, where $H$ is the time horizon and $d$ is the dimension of the feature vectors, giving the first polynomial-sample-complexity online RL algorithm for this setting. The results are proved for the misspecified case, where the sample complexity is shown to degrade gracefully with the misspecification error.
翻译:我们考虑在 episodic 马尔可夫决策过程(MDP)中的在线强化学习(RL),假设其满足线性 $q^\pi$ 可实现性条件,即所有策略的动作值函数均可表示为状态-动作特征的线性函数。已知这类问题比线性 MDP 更具一般性——线性 MDP 假设转移核与奖励函数均为特征向量的线性函数。我们的首要贡献在于揭示两类问题的本质差异:在线性 $q^\pi$ 可实现 MDP 中,存在某些状态,对于任意策略而言,所有动作在该近似具有相同的价值;若在这些状态中遵循任意固定策略进行跳转,问题将转化为线性 MDP。基于这一发现,我们为线性 $q^\pi$ 可实现 MDP 提出了一种新型(计算效率欠佳)的学习算法,该算法能够同时学习应跳过哪些状态,并针对问题中隐藏的线性 MDP 运行另一学习算法。该方法在与 MDP 进行 $\text{polylog}(H, d)/\epsilon^2$ 次交互后即可返回 $\epsilon$-最优策略(其中 $H$ 为时间跨度,$d$ 为特征向量维度),首次为此类设定提供了多项式样本复杂度的在线 RL 算法。本文针对模型错误指定情形进行了理论证明,表明样本复杂度会随错误指定误差的增大而优雅地退化。