Is Value Learning Really the Main Bottleneck in Offline RL?

While imitation learning requires access to high-quality data, offline reinforcement learning (RL) should, in principle, perform similarly or better with substantially lower data quality by using a value function. However, current results indicate that offline RL often performs worse than imitation learning, and it is often unclear what holds back the performance of offline RL. Motivated by this observation, we aim to understand the bottlenecks in current offline RL algorithms. While poor performance of offline RL is typically attributed to an imperfect value function, we ask: is the main bottleneck of offline RL indeed in learning the value function, or something else? To answer this question, we perform a systematic empirical study of (1) value learning, (2) policy extraction, and (3) policy generalization in offline RL problems, analyzing how these components affect performance. We make two surprising observations. First, we find that the choice of a policy extraction algorithm significantly affects the performance and scalability of offline RL, often more so than the value learning objective. For instance, we show that common value-weighted behavioral cloning objectives (e.g., AWR) do not fully leverage the learned value function, and switching to behavior-constrained policy gradient objectives (e.g., DDPG+BC) often leads to substantial improvements in performance and scalability. Second, we find that a big barrier to improving offline RL performance is often imperfect policy generalization on test-time states out of the support of the training data, rather than policy learning on in-distribution states. We then show that the use of suboptimal but high-coverage data or test-time policy training techniques can address this generalization issue in practice. Specifically, we propose two simple test-time policy improvement methods and show that these methods lead to better performance.

翻译：虽然模仿学习需要高质量数据，但离线强化学习（RL）原则上应通过使用价值函数，在数据质量显著较低的情况下实现相当甚至更好的性能。然而，当前结果表明，离线RL的性能往往不如模仿学习，且通常不清楚是什么限制了离线RL的性能。基于这一观察，我们旨在理解当前离线RL算法的瓶颈。虽然离线RL性能不佳通常归因于不完美的价值函数，但我们提出疑问：离线RL的主要瓶颈真的在于学习价值函数，还是其他因素？为回答此问题，我们对离线RL问题中的（1）价值学习、（2）策略提取和（3）策略泛化进行了系统的实证研究，分析了这些组件如何影响性能。我们得到了两个令人惊讶的发现。首先，我们发现策略提取算法的选择对离线RL的性能和可扩展性有显著影响，其影响程度通常超过价值学习目标。例如，我们证明常见的价值加权行为克隆目标（如AWR）未能充分利用已学习的价值函数，而切换到行为约束的策略梯度目标（如DDPG+BC）通常能带来性能和可扩展性的显著提升。其次，我们发现提升离线RL性能的一大障碍通常是策略在训练数据支持范围外的测试状态上的泛化能力不足，而非在分布内状态上的策略学习。我们随后证明，使用次优但高覆盖度的数据或测试时策略训练技术可以在实践中解决这一泛化问题。具体而言，我们提出了两种简单的测试时策略改进方法，并证明这些方法能带来更好的性能。