The prevalent use of benchmarks in current offline reinforcement learning (RL) research has led to a neglect of the imbalance of real-world dataset distributions in the development of models. The real-world offline RL dataset is often imbalanced over the state space due to the challenge of exploration or safety considerations. In this paper, we specify properties of imbalanced datasets in offline RL, where the state coverage follows a power law distribution characterized by skewed policies. Theoretically and empirically, we show that typically offline RL methods based on distributional constraints, such as conservative Q-learning (CQL), are ineffective in extracting policies under the imbalanced dataset. Inspired by natural intelligence, we propose a novel offline RL method that utilizes the augmentation of CQL with a retrieval process to recall past related experiences, effectively alleviating the challenges posed by imbalanced datasets. We evaluate our method on several tasks in the context of imbalanced datasets with varying levels of imbalance, utilizing the variant of D4RL. Empirical results demonstrate the superiority of our method over other baselines.
翻译:当前离线强化学习研究中广泛使用基准数据集,导致模型开发过程中忽视了现实世界数据集分布的不平衡性。由于探索难度或安全考量,现实世界的离线强化学习数据集常存在状态空间上的不平衡分布。本文定义了离线强化学习中不平衡数据集的特征,即状态覆盖遵循由偏斜策略主导的幂律分布。理论分析与实验证明,基于分布约束的典型离线强化学习方法(如保守Q学习)在处理不平衡数据集时难以有效提取策略。受自然智能启发,我们提出一种新型离线强化学习方法,通过在保守Q学习中引入检索机制增强过往相关经验的回溯能力,有效缓解不平衡数据集带来的挑战。我们采用D4RL变体构建了不同失衡程度的不平衡数据集,在多项任务中验证了该方法的效果。实验结果证明,本方法显著优于其他基线模型。