Reinforcement learning (RL) has shown empirical success in various real world settings with complex models and large state-action spaces. The existing analytical results, however, typically focus on settings with a small number of state-actions or simple models such as linearly modeled state-action value functions. To derive RL policies that efficiently handle large state-action spaces with more general value functions, some recent works have considered nonlinear function approximation using kernel ridge regression. We propose $\pi$-KRVI, an optimistic modification of least-squares value iteration, when the state-action value function is represented by an RKHS. We prove the first order-optimal regret guarantees under a general setting. Our results show a significant polynomial in the number of episodes improvement over the state of the art. In particular, with highly non-smooth kernels (such as Neural Tangent kernel or some Mat\'ern kernels) the existing results lead to trivial (superlinear in the number of episodes) regret bounds. We show a sublinear regret bound that is order optimal in the case of Mat\'ern kernels where a lower bound on regret is known.
翻译:强化学习(RL)在具有复杂模型和大状态-动作空间的各种实际场景中已展现出经验成功。然而,现有分析结果通常聚焦于少量状态-动作的设定或线性建模的状态-动作值函数等简单模型。为了推导能高效处理具有更一般值函数的大状态-动作空间的RL策略,近期一些工作考虑了利用核岭回归的非线性函数逼近。本文提出了$\pi$-KRVI——当状态-动作值函数由RKHS表示时,最小二乘值迭代的一种乐观修正。我们在一般设定下首次证明了阶最优的遗憾保证。我们的结果表明,与现有最先进成果相比,在情节数上实现了显著的多项式改进。特别地,对于高度非光滑核(如神经正切核或某些Matérn核),现有结果会导致平庸的遗憾界(在情节数上呈超线性增长)。我们针对已知遗憾下界的Matérn核,证明了次线性且阶最优的遗憾界。