To tackle long planning horizon problems in reinforcement learning with general function approximation, we propose the first algorithm, termed as UCRL-WVTR, that achieves both \emph{horizon-free} and \emph{instance-dependent}, since it eliminates the polynomial dependency on the planning horizon. The derived regret bound is deemed \emph{sharp}, as it matches the minimax lower bound when specialized to linear mixture MDPs up to logarithmic factors. Furthermore, UCRL-WVTR is \emph{computationally efficient} with access to a regression oracle. The achievement of such a horizon-free, instance-dependent, and sharp regret bound hinges upon (i) novel algorithm designs: weighted value-targeted regression and a high-order moment estimator in the context of general function approximation; and (ii) fine-grained analyses: a novel concentration bound of weighted non-linear least squares and a refined analysis which leads to the tight instance-dependent bound. We also conduct comprehensive experiments to corroborate our theoretical findings.
翻译:为解决通用函数逼近下强化学习中的长规划视界问题,我们提出了首个算法UCRL-WVTR,该算法同时实现了\emph{无界视界}和\emph{实例依赖},因为它消除了对规划视界的多项式依赖。导出的遗憾界被认为是\emph{尖锐的},因为当特化为线性混合MDP时,它最多相差对数因子即可匹配极小化下界。此外,借助回归预言机,UCRL-WVTR是\emph{计算高效的}。这种无界视界、实例依赖且尖锐的遗憾界的实现依赖于:(i) 新颖的算法设计:通用函数逼近下的加权值目标回归与高阶矩估计器;(ii) 精细的分析:加权非线性最小二乘的新型浓度界以及导致紧致实例依赖界的精细化分析。我们还进行了全面实验以验证理论结果。