Designing sample-efficient and computationally feasible reinforcement learning (RL) algorithms is particularly challenging in environments with large or infinite state and action spaces. In this paper, we advance this effort by presenting an efficient algorithm for Markov Decision Processes (MDPs) where the state-action value function of any policy is linear in a given feature map. This challenging setting can model environments with infinite states and actions, strictly generalizes classic linear MDPs, and currently lacks a computationally efficient algorithm under online access to the MDP. Specifically, we introduce a new RL algorithm that efficiently finds a near-optimal policy in this setting, using a number of episodes and calls to a cost-sensitive classification (CSC) oracle that are both polynomial in the problem parameters. Notably, our CSC oracle can be efficiently implemented when the feature dimension is constant, representing a clear improvement over state-of-the-art methods, which require solving non-convex problems with horizon-many variables and can incur computational costs that are exponential in the horizon.
翻译:设计样本高效且计算可行的强化学习算法在状态和动作空间巨大或无限的环境中尤为困难。本文通过提出一种针对马尔可夫决策过程的高效算法推进了这一努力,该算法适用于任意策略的状态-动作值函数在给定特征映射下呈线性的场景。这一具有挑战性的设定能够对具有无限状态和动作的环境进行建模,严格推广了经典的线性MDP,并且目前在在线访问MDP的条件下尚缺乏计算高效的算法。具体而言,我们引入了一种新的强化学习算法,该算法在此设定下能够高效地找到近似最优策略,其使用的回合数以及对代价敏感分类预言机的调用次数均关于问题参数呈多项式级别。值得注意的是,当特征维度为常数时,我们的代价敏感分类预言机可以有效实现,这相较于现有最先进方法代表了显著改进——后者需要求解具有时间步长数量级变量的非凸问题,并可能产生随时间步长呈指数增长的计算成本。