We propose a new stochastic primal-dual optimization algorithm for planning in a large discounted Markov decision process with a generative model and linear function approximation. Assuming that the feature map approximately satisfies standard realizability and Bellman-closedness conditions and also that the feature vectors of all state-action pairs are representable as convex combinations of a small core set of state-action pairs, we show that our method outputs a near-optimal policy after a polynomial number of queries to the generative model. Our method is computationally efficient and comes with the major advantage that it outputs a single softmax policy that is compactly represented by a low-dimensional parameter vector, and does not need to execute computationally expensive local planning subroutines in runtime.
翻译:我们提出一种新的随机原始-对偶优化算法,用于在具备生成模型和线性函数逼近的大规模折扣马尔可夫决策过程中进行规划。假设特征映射近似满足标准可实现性与贝尔曼封闭性条件,且所有状态-动作对的特征向量可表示为少量核心状态-动作对特征向量的凸组合,我们证明该方法在向生成模型进行多项式次查询后即可输出近似最优策略。该算法具有计算高效性,主要优势在于能输出由低维参数向量紧凑表示的单一softmax策略,且无需在运行时执行计算昂贵的局部规划子程序。