Modern reinforcement learning (RL) often faces an enormous state-action space. Existing analytical results are typically for settings with a small number of state-actions, or simple models such as linearly modeled Q-functions. To derive statistically efficient RL policies handling large state-action spaces, with more general Q-functions, some recent works have considered nonlinear function approximation using kernel ridge regression. In this work, we derive sample complexities for kernel based Q-learning when a generative model exists. We propose a nonparametric Q-learning algorithm which finds an $\epsilon$-optimal policy in an arbitrarily large scale discounted MDP. The sample complexity of the proposed algorithm is order optimal with respect to $\epsilon$ and the complexity of the kernel (in terms of its information gain). To the best of our knowledge, this is the first result showing a finite sample complexity under such a general model.
翻译:现代强化学习常常面临巨大的状态-动作空间。现有分析结果通常针对状态-动作数量较少的场景,或采用线性建模Q函数等简单模型。为了在大规模状态-动作空间中推导出具有统计效率的强化学习策略,同时处理更通用的Q函数,部分近期研究采用核岭回归进行非线性函数逼近。本文推导了存在生成模型时基于核的Q学习的样本复杂度。我们提出一种非参数Q学习算法,该算法可在任意大规模折扣马尔可夫决策过程中找到$\epsilon$-最优策略。所提算法的样本复杂度关于$\epsilon$和核复杂度(以其信息增益度量)达到阶数最优。据我们所知,这是首个在此类通用模型下证明有限样本复杂度的结果。