Reinforcement learning utilizing kernel ridge regression to predict the expected value function represents a powerful method with great representational capacity. This setting is a highly versatile framework amenable to analytical results. We consider kernel-based function approximation for RL in the infinite horizon average reward setting, also referred to as the undiscounted setting. We propose an optimistic algorithm, similar to acquisition function based algorithms in the special case of bandits. We establish novel no-regret performance guarantees for our algorithm, under kernel-based modelling assumptions. Additionally, we derive a novel confidence interval for the kernel-based prediction of the expected value function, applicable across various RL problems.
翻译:利用核岭回归预测期望值函数的强化学习方法具有强大的表示能力。该框架具备高度通用性,适合进行理论分析。本文研究无限时域平均奖励(亦称无折扣)设定下基于核函数的强化学习函数逼近方法。我们提出一种乐观算法,其设计思路类似于多臂赌博机中基于采集函数的特殊算法。在核函数建模假设下,我们为该算法建立了新颖的无遗憾性能保证。此外,我们推导出适用于各类强化学习问题的、基于核函数的期望值函数预测新型置信区间。