Offline reinforcement learning (RL) learns effective policies from a static target dataset. Despite state-of-the-art (SOTA) offline RL algorithms being promising, they highly rely on the quality of the target dataset. The performance of SOTA algorithms can degrade in scenarios with limited samples in the target dataset, which is often the case in real-world applications. To address this issue, domain adaptation that leverages auxiliary samples from related source datasets (such as simulators) can be beneficial. In this context, determining the optimal way to trade off the source and target datasets remains a critical challenge in offline RL. To the best of our knowledge, this paper proposes the first framework that theoretically and experimentally explores how the weight assigned to each dataset affects the performance of offline RL. We establish the performance bounds and convergence neighborhood of our framework, both of which depend on the selection of the weight. Furthermore, we identify the existence of an optimal weight for balancing the two datasets. All theoretical guarantees and optimal weight depend on the quality of the source dataset and the size of the target dataset. Our empirical results on the well-known Procgen Benchmark substantiate our theoretical contributions.
翻译:离线强化学习(RL)从静态目标数据集中学习有效策略。尽管最先进的(SOTA)离线RL算法前景广阔,但它们高度依赖于目标数据集的质量。在目标数据集样本有限的情况下(这在现实应用中很常见),SOTA算法的性能可能会下降。为解决此问题,利用来自相关源数据集(如模拟器)的辅助样本进行领域自适应可能是有益的。在此背景下,确定如何在源数据集和目标数据集之间进行权衡的最佳方式,仍然是离线RL中的一个关键挑战。据我们所知,本文提出了首个从理论和实验上探索分配给每个数据集的权重如何影响离线RL性能的框架。我们建立了该框架的性能边界和收敛邻域,两者都取决于权重的选择。此外,我们确定了存在一个用于平衡两个数据集的最优权重。所有理论保证和最优权重都取决于源数据集的质量和目标数据集的大小。我们在著名的Procgen基准测试上的实证结果证实了我们的理论贡献。