As with any machine learning problem with limited data, effective offline RL algorithms require careful regularization to avoid overfitting. One-step methods perform regularization by doing just a single step of policy improvement, while critic regularization methods do many steps of policy improvement with a regularized objective. These methods appear distinct. One-step methods, such as advantage-weighted regression and conditional behavioral cloning, truncate policy iteration after just one step. This ``early stopping'' makes one-step RL simple and stable, but can limit its asymptotic performance. Critic regularization typically requires more compute but has appealing lower-bound guarantees. In this paper, we draw a close connection between these methods: applying a multi-step critic regularization method with a regularization coefficient of 1 yields the same policy as one-step RL. While practical implementations violate our assumptions and critic regularization is typically applied with smaller regularization coefficients, our experiments nevertheless show that our analysis makes accurate, testable predictions about practical offline RL methods (CQL and one-step RL) with commonly-used hyperparameters. Our results that every problem can be solved with a single step of policy improvement, but rather that one-step RL might be competitive with critic regularization on RL problems that demand strong regularization.
翻译:如同任何数据有限的机器学习问题,有效的离线强化学习算法需要仔细的正则化以避免过拟合。一阶方法通过仅执行单步策略改进来实现正则化,而评论家正则化方法则使用正则化目标执行多步策略改进。这些方法看似不同。例如,优势加权回归和条件行为克隆等一阶方法在仅一步后截断策略迭代。这种“早停”机制使得一阶强化学习简单且稳定,但可能限制其渐进性能。评论家正则化通常需要更多计算,但具有令人满意的下界保证。本文揭示了这些方法之间的紧密联系:使用正则化系数为1的多步评论家正则化方法,产生的策略与一阶强化学习相同。尽管实际实现违背我们的假设,且评论家正则化通常采用较小的正则化系数,但我们的实验表明,我们的分析对使用常见超参数的实际离线强化学习方法(CQL和一阶强化学习)做出了准确、可检验的预测。我们的研究结果并非每个问题都能通过单步策略改进解决,而是表明在需要强正则化的强化学习问题中,一阶强化学习可能具有与评论家正则化相竞争的潜力。