Online reinforcement learning in non-episodic, finite-horizon MDPs remains underexplored and is challenged by the need to estimate returns to a fixed terminal time. Existing infinite-horizon methods, which often rely on discounted contraction, do not naturally account for this fixed-horizon structure. We introduce a modified Q-function: rather than targeting the full-horizon, we learn a K-step lookahead Q-function that truncates planning to the next K steps. To further improve sample efficiency, we introduce a thresholding mechanism: actions are selected only when their estimated K-step lookahead value exceeds a time-varying threshold. We provide an efficient tabular learning algorithm for this novel objective, proving it achieves fast finite-sample convergence: it achieves minimax optimal constant regret for $K=1$ and $\mathcal{O}(\max((K-1),C_{K-1})\sqrt{SAT\log(T)})$ regret for any $K \geq 2$. We numerically evaluate the performance of our algorithm under the objective of maximizing reward. Our implementation adaptively increases K over time, balancing lookahead depth against estimation variance. Empirical results demonstrate superior cumulative rewards over state-of-the-art tabular RL methods across synthetic MDPs and RL environments: JumpRiverswim, FrozenLake and AnyTrading. Code is provided on \href{https://github.com/jamie01713/K-Step-Lookahead}{github}.
翻译:在非回合制有限时域马尔可夫决策过程(MDP)中,在线强化学习仍是一个被忽视的领域,其难点在于需要估算到固定终止时刻的回报。现有无限时域方法往往依赖于折扣收缩,无法自然适应这种固定时域结构。我们引入了一种改进的Q函数:不直接面向全时域目标,而是学习一个仅规划未来K步的K步前瞻Q函数。为进一步提升样本效率,我们提出阈值机制——仅在估计的K步前瞻值超过时变阈值时才选择动作。针对这一新目标,我们开发了高效的表格式学习算法,并证明其能实现快速有限样本收敛:当K=1时达到极小极大最优常数遗憾值,当K≥2时遗憾值为\(\mathcal{O}(\max((K-1),C_{K-1})\sqrt{SAT\log(T)})\)。我们以最大化奖励为目标进行了数值评估,实现中通过自适应增加K值来平衡前瞻深度与估计方差。实验结果表明,在合成MDP及JumpRiverswim、FrozenLake、AnyTrading等强化学习环境中,该方法累计奖励显著优于现有最优表格式强化学习方法。代码已开源在\href{https://github.com/jamie01713/K-Step-Lookahead}{GitHub}上。