We study offline reinforcement learning (RL) which seeks to learn a good policy based on a fixed, pre-collected dataset. A fundamental challenge behind this task is the distributional shift due to the dataset lacking sufficient exploration, especially under function approximation. To tackle this issue, we propose a bi-level structured policy optimization algorithm that models a hierarchical interaction between the policy (upper-level) and the value function (lower-level). The lower level focuses on constructing a confidence set of value estimates that maintain sufficiently small weighted average Bellman errors, while controlling uncertainty arising from distribution mismatch. Subsequently, at the upper level, the policy aims to maximize a conservative value estimate from the confidence set formed at the lower level. This novel formulation preserves the maximum flexibility of the implicitly induced exploratory data distribution, enabling the power of model extrapolation. In practice, it can be solved through a computationally efficient, penalized adversarial estimation procedure. Our theoretical regret guarantees do not rely on any data-coverage and completeness-type assumptions, only requiring realizability. These guarantees also demonstrate that the learned policy represents the "best effort" among all policies, as no other policies can outperform it. We evaluate our model using a blend of synthetic, benchmark, and real-world datasets for offline RL, showing that it performs competitively with state-of-the-art methods.
翻译:我们研究离线强化学习(RL),其目标是基于固定的、预先收集的数据集学习一个良好的策略。该任务面临的一个基本挑战是分布偏移,这是由于数据集缺乏充分探索所致,尤其是在函数近似情况下。为解决此问题,我们提出一种双层结构化的策略优化算法,该算法对策略(上层)与价值函数(下层)之间的层级交互进行建模。下层专注于构建一个价值估计的置信集,该集合保持足够小的加权平均贝尔曼误差,同时控制由分布不匹配引起的不确定性。随后,在上层,策略旨在最大化来自下层置信集中的一个保守价值估计。这种新颖的公式保留了隐含诱导的探索性数据分布的最大灵活性,从而发挥了模型外推的能力。在实际应用中,它可以通过一种计算高效的惩罚对抗性估计程序求解。我们的理论遗憾界不依赖于任何数据覆盖和完备性类型假设,仅需可实现性。这些遗憾界还表明,所学策略代表了所有策略中的"最佳努力",因为没有任何其他策略能够超越它。我们使用合成数据集、基准数据集和真实世界数据集的混合对离线RL模型进行评估,结果显示其与最先进的方法相比具有竞争力。