The use of pessimism, when reasoning about datasets lacking exhaustive exploration has recently gained prominence in offline reinforcement learning. Despite the robustness it adds to the algorithm, overly pessimistic reasoning can be equally damaging in precluding the discovery of good policies, which is an issue for the popular bonus-based pessimism. In this paper, we introduce the notion of Bellman-consistent pessimism for general function approximation: instead of calculating a point-wise lower bound for the value function, we implement pessimism at the initial state over the set of functions consistent with the Bellman equations. Our theoretical guarantees only require Bellman closedness as standard in the exploratory setting, in which case bonus-based pessimism fails to provide guarantees. Even in the special case of linear function approximation where stronger expressivity assumptions hold, our result improves upon a recent bonus-based approach by $\mathcal{O}(d)$ in its sample complexity when the action space is finite. Remarkably, our algorithms automatically adapt to the best bias-variance tradeoff in the hindsight, whereas most prior approaches require tuning extra hyperparameters a priori.
翻译:在离线强化学习中,当处理缺乏充分探索的数据集时,采用悲观主义方法近期备受关注。尽管悲观主义增强了算法的鲁棒性,但过度悲观的推理同样会阻碍发现最优策略——这对基于奖励的流行悲观主义方法构成了问题。本文针对一般函数逼近场景提出Bellman一致性悲观主义概念:不同于计算值函数的逐点下界,我们在初始状态下对满足Bellman方程的函数集合施加悲观约束。我们的理论保证仅需探索性设定中标准的Bellman封闭性条件,而在此条件下基于奖励的悲观主义无法提供保障。即使在线性函数逼近这一更强表达性假设成立的特殊情形下,当动作空间有限时,我们的结果将近期基于奖励方法的样本复杂度降低了$\mathcal{O}(d)$。值得关注的是,我们的算法能自动实现事后最优偏差-方差权衡,而先前多数方法需要预先调整额外超参数。