We study reinforcement learning (RL) with transition look-ahead, where the agent may observe which states would be visited upon playing any sequence of $\ell$ actions before deciding its course of action. While such predictive information can drastically improve the achievable performance, we show that using this information optimally comes at a potentially prohibitive computational cost. Specifically, we prove that optimal planning with one-step look-ahead ($\ell=1$) can be solved in polynomial time through a novel linear programming formulation. In contrast, for $\ell \geq 2$, the problem becomes NP-hard. Our results delineate a precise boundary between tractable and intractable cases for the problem of planning with transition look-ahead in reinforcement learning.
翻译:我们研究了具有转移前瞻能力的强化学习(RL),其中智能体在决定行动方案之前,可以观察到执行任何长度为 $\ell$ 的动作序列后将访问哪些状态。虽然这种预测信息可以显著提升可达性能,但我们证明,最优利用这类信息可能带来潜在的高昂计算成本。具体而言,我们证明,通过一种新颖的线性规划公式,具有单步前瞻能力($\ell=1$)的最优规划可以在多项式时间内求解。相反,对于 $\ell \geq 2$,该问题变为NP-难问题。我们的结果刻画了在强化学习中具有转移前瞻能力的规划问题在可解与不可解情况之间的精确界限。