In offline reinforcement learning, the challenge of out-of-distribution (OOD) is pronounced. To address this, existing methods often constrain the learned policy through policy regularization. However, these methods often suffer from the issue of unnecessary conservativeness, hampering policy improvement. This occurs due to the indiscriminate use of all actions from the behavior policy that generates the offline dataset as constraints. The problem becomes particularly noticeable when the quality of the dataset is suboptimal. Thus, we propose Adaptive Advantage-guided Policy Regularization (A2PR), obtaining high-advantage actions from an augmented behavior policy combined with VAE to guide the learned policy. A2PR can select high-advantage actions that differ from those present in the dataset, while still effectively maintaining conservatism from OOD actions. This is achieved by harnessing the VAE capacity to generate samples matching the distribution of the data points. We theoretically prove that the improvement of the behavior policy is guaranteed. Besides, it effectively mitigates value overestimation with a bounded performance gap. Empirically, we conduct a series of experiments on the D4RL benchmark, where A2PR demonstrates state-of-the-art performance. Furthermore, experimental results on additional suboptimal mixed datasets reveal that A2PR exhibits superior performance. Code is available at https://github.com/ltlhuuu/A2PR.
翻译:在离线强化学习中,分布外(OOD)问题十分突出。为解决此问题,现有方法通常通过策略正则化来约束所学策略。然而,这些方法常受不必要的保守性所困,阻碍了策略的改进。这是由于它们不加区分地将生成离线数据集的行为策略的所有动作均用作约束。当数据集质量欠佳时,该问题尤为明显。为此,我们提出自适应优势引导策略正则化(A2PR),通过结合变分自编码器(VAE)的增强行为策略获取高优势动作,以指导所学策略。A2PR能够选择与数据集中现有动作不同的高优势动作,同时仍能有效保持对OOD动作的保守性。这是通过利用VAE生成与数据点分布匹配的样本的能力实现的。我们从理论上证明了行为策略的改进是有保证的。此外,该方法通过有界的性能差距有效缓解了价值高估问题。在实证研究中,我们在D4RL基准测试上进行了一系列实验,结果表明A2PR实现了最先进的性能。此外,在额外次优混合数据集上的实验结果进一步显示,A2PR表现出卓越的性能。代码可在 https://github.com/ltlhuuu/A2PR 获取。