We study reinforcement learning (RL) with linear function approximation in Markov Decision Processes (MDPs) satisfying \emph{linear Bellman completeness} -- a fundamental setting where the Bellman backup of any linear value function remains linear. While statistically tractable, prior computationally efficient algorithms are either limited to small action spaces or require strong oracle assumptions over the feature space. We provide a computationally efficient algorithm for linear Bellman complete MDPs with \emph{deterministic transitions}, stochastic initial states, and stochastic rewards. For finite action spaces, our algorithm is end-to-end efficient; for large or infinite action spaces, we require only a standard argmax oracle over actions. Our algorithm learns an $\varepsilon$-optimal policy with sample and computational complexity polynomial in the horizon, feature dimension, and $1/\varepsilon$.
翻译:我们研究马尔可夫决策过程(MDP)中采用线性函数逼近的强化学习(RL),这些过程满足"线性贝尔曼完备性"——这是一个基础性设定,其中任何线性价值函数的贝尔曼备份仍保持线性。尽管从统计角度可处理,但先前的计算高效算法要么局限于小动作空间,要么需要对特征空间施加强预言假设。我们为具有确定性转移、随机初始状态和随机奖励的线性贝尔曼完备MDP提供了一种计算高效算法。对于有限动作空间,我们的算法可实现端到端高效;对于大型或无限动作空间,我们仅需一个标准动作上的最大值预言。该算法能以样本复杂度和计算复杂度关于规划周期、特征维度和$1/\varepsilon$的多项式形式学习到$\varepsilon$-最优策略。