We propose a novel model-based offline Reinforcement Learning (RL) framework, called Adversarial Model for Offline Reinforcement Learning (ARMOR), which can robustly learn policies to improve upon an arbitrary reference policy regardless of data coverage. ARMOR is designed to optimize policies for the worst-case performance relative to the reference policy through adversarially training a Markov decision process model. In theory, we prove that ARMOR, with a well-tuned hyperparameter, can compete with the best policy within data coverage when the reference policy is supported by the data. At the same time, ARMOR is robust to hyperparameter choices: the policy learned by ARMOR, with "any" admissible hyperparameter, would never degrade the performance of the reference policy, even when the reference policy is not covered by the dataset. To validate these properties in practice, we design a scalable implementation of ARMOR, which by adversarial training, can optimize policies without using model ensembles in contrast to typical model-based methods. We show that ARMOR achieves competent performance with both state-of-the-art offline model-free and model-based RL algorithms and can robustly improve the reference policy over various hyperparameter choices.
翻译:我们提出了一种新颖的基于模型的离线强化学习框架,称为离线强化学习对抗模型(ARMOR),该框架能够稳健地学习策略,以改进任意参考策略,无论数据覆盖情况如何。ARMOR旨在通过对抗性训练马尔可夫决策过程模型,优化相对于参考策略的最坏情况性能。理论上,我们证明ARMOR在数据覆盖范围内,当参考策略由数据支持时,通过精心调整超参数,可以与最优策略相竞争。同时,ARMOR对超参数选择具有鲁棒性:使用“任何”可接受超参数学习的ARMOR策略,即使在数据集未覆盖参考策略的情况下,也绝不会降低参考策略的性能。为在实践中验证这些特性,我们设计了ARMOR的可扩展实现,通过对抗性训练,可优化策略而无需像典型基于模型的方法那样使用模型集成。我们表明ARMOR在性能上与最先进的离线无模型和基于模型的强化学习算法相当,并且能在多种超参数选择下稳健地改进参考策略。