Bootstrapping and rollout are two fundamental principles for value function estimation in reinforcement learning (RL). We introduce a novel class of Bellman operators, called subgraph Bellman operators, that interpolate between bootstrapping and rollout methods. Our estimator, derived by solving the fixed point of the empirical subgraph Bellman operator, combines the strengths of the bootstrapping-based temporal difference (TD) estimator and the rollout-based Monte Carlo (MC) methods. Specifically, the error upper bound of our estimator approaches the optimal variance achieved by TD, with an additional term depending on the exit probability of a selected subset of the state space. At the same time, the estimator exhibits the finite-sample adaptivity of MC, with sample complexity depending only on the occupancy measure of this subset. We complement the upper bound with an information-theoretic lower bound, showing that the additional term is unavoidable given a reasonable sample size. Together, these results establish subgraph Bellman estimators as an optimal and adaptive framework for reconciling TD and MC methods in policy evaluation.
翻译:引导和展开是强化学习中价值函数估计的两个基本原则。我们引入了一类新颖的贝尔曼算子,称为子图贝尔曼算子,它在引导和展开方法之间进行插值。通过求解经验子图贝尔曼算子的不动点,我们推导出的估计器结合了基于引导的时间差分估计器与基于展开的蒙特卡洛方法的优势。具体而言,该估计器的误差上界趋近于时间差分方法所达到的最优方差,并附加一个取决于所选状态空间子集退出概率的项。同时,该估计器展现出蒙特卡洛方法的有限样本自适应性,其样本复杂度仅取决于该子集的占用测度。我们通过一个信息论下界对上界进行了补充,表明在合理的样本量下,该附加项是不可避免的。这些结果共同确立了子图贝尔曼估计器作为调和时间差分与蒙特卡洛方法在策略评估中的最优且自适应的理论框架。