Batch reinforcement learning (RL) defines the task of learning from a fixed batch of data lacking exhaustive exploration. Worst-case optimality algorithms, which calibrate a value-function model class from logged experience and perform some type of pessimistic evaluation under the learned model, have emerged as a promising paradigm for batch RL. However, contemporary works on this stream have commonly overlooked the hierarchical decision-making structure hidden in the optimization landscape. In this paper, we adopt a game-theoretical viewpoint and model the policy learning diagram as a two-player general-sum game with a leader-follower structure. We propose a novel stochastic gradient-based learning algorithm: StackelbergLearner, in which the leader player updates according to the total derivative of its objective instead of the usual individual gradient, and the follower player makes individual updates and ensures transition-consistent pessimistic reasoning. The derived learning dynamic naturally lends StackelbergLearner to a game-theoretic interpretation and provides a convergence guarantee to differentiable Stackelberg equilibria. From a theoretical standpoint, we provide instance-dependent regret bounds with general function approximation, which shows that our algorithm can learn a best-effort policy that is able to compete against any comparator policy that is covered by batch data. Notably, our theoretical regret guarantees only require realizability without any data coverage and strong function approximation conditions, e.g., Bellman closedness, which is in contrast to prior works lacking such guarantees. Through comprehensive experiments, we find that our algorithm consistently performs as well or better as compared to state-of-the-art methods in batch RL benchmark and real-world datasets.
翻译:批量强化学习(batch RL)定义了从缺乏充分探索的固定批量数据中学习的任务。基于最坏情况最优性的算法——通过历史经验校准价值函数模型类,并在所学模型下执行某种悲观评估——已成为批量强化学习的一种有前景范式。然而,当前该方向的研究普遍忽视了优化景观中隐含的层次化决策结构。本文采用博弈论视角,将策略学习框架建模为具有领导者-跟随者结构的双人一般和博弈。我们提出了一种新型随机梯度学习算法:StackelbergLearner,其中领导者玩家根据其目标函数的总导数(而非通常的个体梯度)进行更新,跟随者玩家则进行个体更新并确保转移一致的悲观推理。所推导的学习动态自然赋予StackelbergLearner博弈论解释,并提供到可微Stackelberg均衡的收敛保证。从理论角度,我们给出了具有一般函数逼近的实例相关遗憾界,表明该算法可学习到能对抗由批量数据覆盖的任何比较策略的尽力而为策略。值得注意的是,我们的理论遗憾保证仅需可实现性,无需任何数据覆盖和强函数逼近条件(如Bellman封闭性),这与缺乏此类保证的先前工作形成鲜明对比。通过全面实验,我们发现该算法在批量强化学习基准测试和真实世界数据集上始终表现出与最先进方法相当或更优的性能。