We study tabular reinforcement learning problems with multiple steps of lookahead information. Before acting, the learner observes $\ell$ steps of future transition and reward realizations: the exact state the agent would reach and the rewards it would collect under any possible course of action. While it has been shown that such information can drastically boost the value, finding the optimal policy is NP-hard, and it is common to apply one of two tractable heuristics: processing the lookahead in chunks of predefined sizes ('fixed batching policies'), and model predictive control. We first illustrate the problems with these two approaches and propose utilizing the lookahead in adaptive (state-dependent) batches; we refer to such policies as adaptive batching policies (ABPs). We derive the optimal Bellman equations for these strategies and design an optimistic regret-minimizing algorithm that enables learning the optimal ABP when interacting with unknown environments. Our regret bounds are order-optimal up to a potential factor of the lookahead horizon $\ell$, which can usually be considered a small constant.
翻译:本文研究具有多步前瞻信息的表格型强化学习问题。在采取行动之前,学习者能够观测到未来 $\ell$ 步的状态转移与奖励实现:即在任何可能行动序列下,智能体将到达的确切状态及其将获得的奖励。尽管已有研究表明此类信息能显著提升价值,但寻找最优策略是NP难问题,实践中通常采用两种可处理的启发式方法之一:将前瞻信息按预定义大小分块处理("固定批处理策略"),以及模型预测控制。我们首先阐明这两种方法存在的问题,并提出以自适应(状态依赖的)批处理方式利用前瞻信息;我们将此类策略称为自适应批处理策略(ABP)。我们推导了这些策略的最优贝尔曼方程,并设计了一种乐观的遗憾最小化算法,使得在与未知环境交互时能够学习最优ABP。我们的遗憾界在阶数上是最优的,最多可能包含前瞻视野 $\ell$ 的因子,而 $\ell$ 通常可视为较小常数。