Reinforcement learning (RL) for reachability specifications is fundamental in sequential decision-making, yet theoretical guarantees remain less explored. A recent work achieves asymptotic convergence to optimal policies. However, this approach provides limited insight into convergence dynamics. In this work, we present an alternative approach that provides deeper theoretical insights into convergence. Our approach builds on PAC learning with assumptions. PAC learning guarantees near-optimal policies with high confidence in finite time but requires knowing internal MDP parameters like minimum transition probability. We argue that while these parameters are unknown in RL, they can be iteratively refined and estimated with increasing accuracy. By iteratively satisfying PAC conditions, we show that exact optimality can be achieved in the limit. Empirical evaluations on standard benchmarks validate our theoretical insights into convergence dynamics.
翻译:针对可达性规范的强化学习(RL)在序贯决策中具有基础性地位,但其理论保证仍待深入探索。近期研究虽实现了策略的渐近收敛性,但该方法对收敛动态的揭示有限。本研究提出一种替代方案,可为收敛性提供更深层次的理论洞见。该方法基于带假设的可概率近似正确(PAC)学习框架。PAC学习能在有限时间内以高置信度保证近似最优策略,但需知晓内部马尔可夫决策过程(MDP)参数(如最小转移概率)。我们论证:尽管RL中这些参数未知,但可通过迭代优化逐步提升估计精度。通过迭代满足PAC条件,我们证明了极限状态下可达精确最优性。标准基准实验验证了本研究关于收敛动态的理论洞见。