Online reinforcement learning in non-episodic, finite-horizon MDPs remains underexplored and is challenged by the need to estimate returns to a fixed terminal time. Existing infinite-horizon methods, which often rely on discounted contraction, do not naturally account for this fixed-horizon structure. We introduce a modified Q-function: rather than targeting the full-horizon, we learn a K-step lookahead Q-function that truncates planning to the next K steps. To further improve sample efficiency, we introduce a thresholding mechanism: actions are selected only when their estimated K-step lookahead value exceeds a time-varying threshold. We provide an efficient tabular learning algorithm for this novel objective, proving it achieves fast finite-sample convergence: it achieves minimax optimal constant regret for $K=1$ and $\mathcal{O}(\max((K-1),C_{K-1})\sqrt{SAT\log(T)})$ regret for any $K \geq 2$. We numerically evaluate the performance of our algorithm under the objective of maximizing reward. Our implementation adaptively increases K over time, balancing lookahead depth against estimation variance. Empirical results demonstrate superior cumulative rewards over state-of-the-art tabular RL methods across synthetic MDPs and RL environments: JumpRiverswim, FrozenLake and AnyTrading.
翻译:在非回合制、有限时域的马尔可夫决策过程中,在线强化学习的研究仍显不足,且面临估计固定终止时刻回报的挑战。现有的无限时域方法通常依赖于折扣收缩,无法自然地适应这种固定时域结构。我们引入一种改进的Q函数:不针对完整时域,而是学习一个K步前瞻Q函数,将规划截断至后续K步。为进一步提升样本效率,我们引入阈值机制:仅当估计的K步前瞻值超过时变阈值时,才选择相应动作。我们针对这一新目标提出一种高效的表格学习算法,证明其可实现快速的有限样本收敛:当$K=1$时达到极小极大最优的常数遗憾,对任意$K \geq 2$则获得$\mathcal{O}(\max((K-1),C_{K-1})\sqrt{SAT\log(T)})$遗憾。我们在最大化奖励的目标下对算法性能进行数值评估。我们的实现随时间自适应增加K值,以平衡前瞻深度与估计方差。在合成MDP及强化学习环境(JumpRiverswim、FrozenLake和AnyTrading)上的实证结果表明,本方法在累积奖励方面优于当前最先进的表格强化学习方法。