Existing metrics for reinforcement learning (RL) such as regret, PAC bounds, or uniform-PAC (Dann et al., 2017), typically evaluate the cumulative performance, while allowing the agent to play an arbitrarily bad policy at any finite time t. Such a behavior can be highly detrimental in high-stakes applications. This paper introduces a stronger metric, uniform last-iterate (ULI) guarantee, capturing both cumulative and instantaneous performance of RL algorithms. Specifically, ULI characterizes the instantaneous performance by ensuring that the per-round suboptimality of the played policy is bounded by a function, monotonically decreasing w.r.t. round t, preventing revisiting bad policies when sufficient samples are available. We demonstrate that a near-optimal ULI guarantee directly implies near-optimal cumulative performance across aforementioned metrics, but not the other way around. To examine the achievability of ULI, we first provide two positive results for bandit problems with finite arms, showing that elimination-based algorithms and high-probability adversarial algorithms with stronger analysis or additional designs, can attain near-optimal ULI guarantees. We also provide a negative result, indicating that optimistic algorithms cannot achieve near-optimal ULI guarantee. Furthermore, we propose an efficient algorithm for linear bandits with infinitely many arms, which achieves the ULI guarantee, given access to an optimization oracle. Finally, we propose an algorithm that achieves near-optimal ULI guarantee for the online reinforcement learning setting.
翻译:现有的强化学习评估指标,如悔恨度、PAC界或均匀PAC界(Dann等人,2017),通常评估累积性能,同时允许智能体在任意有限时间t执行任意糟糕的策略。这种行为在高风险应用中可能极具危害性。本文提出一种更强的度量标准——均匀最终迭代保证,它能同时捕捉强化学习算法的累积性能与瞬时性能。具体而言,ULI通过确保所执行策略的每轮次优性受一个关于轮次t单调递减的函数约束来刻画瞬时性能,从而在获得足够样本时避免重新采用劣质策略。我们证明,接近最优的ULI保证可直接推出在上述所有度量标准下的接近最优累积性能,但反之不成立。为探究ULI的可实现性,我们首先针对有限臂赌博机问题给出两个积极结果:基于淘汰的算法以及经过强化分析或附加设计的高概率对抗性算法,能够获得接近最优的ULI保证。同时我们给出一个消极结果,表明乐观算法无法实现接近最优的ULI保证。此外,我们提出一种适用于无限臂线性赌博机的高效算法,在具备优化预言机访问权限的条件下实现ULI保证。最后,我们提出一种在线强化学习场景下能达到接近最优ULI保证的算法。