The main challenge of offline reinforcement learning, where data is limited, arises from a sequence of counterfactual reasoning dilemmas within the realm of potential actions: What if we were to choose a different course of action? These circumstances frequently give rise to extrapolation errors, which tend to accumulate exponentially with the problem horizon. Hence, it becomes crucial to acknowledge that not all decision steps are equally important to the final outcome, and to budget the number of counterfactual decisions a policy make in order to control the extrapolation. Contrary to existing approaches that use regularization on either the policy or value function, we propose an approach to explicitly bound the amount of out-of-distribution actions during training. Specifically, our method utilizes dynamic programming to decide where to extrapolate and where not to, with an upper bound on the decisions different from behavior policy. It balances between the potential for improvement from taking out-of-distribution actions and the risk of making errors due to extrapolation. Theoretically, we justify our method by the constrained optimality of the fixed point solution to our $Q$ updating rules. Empirically, we show that the overall performance of our method is better than the state-of-the-art offline RL methods on tasks in the widely-used D4RL benchmarks.
翻译:离线强化学习的主要挑战,源于数据有限条件下潜在行动领域中的一系列反事实推理困境:如果选择不同的行动方案会怎样?这些情况常常导致外推误差,且该误差会随问题范围呈指数级累积。因此,必须认识到并非所有决策步骤对最终结果同等重要,并对策略做出的反事实决策数量进行预算以控制外推。与现有对策略或价值函数进行正则化的方法不同,我们提出了一种方法,在训练过程中显式限制超出分布的动作量。具体而言,我们的方法利用动态规划决定何处进行外推、何处不进行外推,并对不同于行为策略的决策设置上界。它平衡了采取分布外动作所带来的改进潜力与外推错误风险。理论上,我们通过$Q$更新规则不动点解的约束最优性证明了该方法的合理性。实验上,我们展示了在广泛使用的D4RL基准测试任务中,该方法整体性能优于最先进的离线强化学习方法。