Traditional offline reinforcement learning methods predominantly operate in a batch-constrained setting. This confines the algorithms to a specific state-action distribution present in the dataset, reducing the effects of distributional shift but restricting the algorithm greatly. In this paper, we alleviate this limitation by introducing a novel framework named \emph{state-constrained} offline reinforcement learning. By exclusively focusing on the dataset's state distribution, our framework significantly enhances learning potential and reduces previous limitations. The proposed setting not only broadens the learning horizon but also improves the ability to combine different trajectories from the dataset effectively, a desirable property inherent in offline reinforcement learning. Our research is underpinned by solid theoretical findings that pave the way for subsequent advancements in this domain. Additionally, we introduce StaCQ, a deep learning algorithm that is both performance-driven on the D4RL benchmark datasets and closely aligned with our theoretical propositions. StaCQ establishes a strong baseline for forthcoming explorations in state-constrained offline reinforcement learning.
翻译:传统的离线强化学习方法主要在批量约束环境下运行。这使算法局限于数据集中存在的特定状态-动作分布,虽减轻了分布偏移的影响,却极大地限制了算法性能。本文通过提出名为"状态约束"离线强化学习的新框架,有效缓解了这一局限性。该框架仅关注数据集的状态分布,显著提升了学习潜力并突破了原有约束。所提出的设定不仅拓宽了学习边界,还增强了从数据集中有效整合不同轨迹的能力——这正是离线强化学习固有的理想特性。本研究以坚实的理论发现为基础,为该领域的后续进展铺平了道路。此外,我们提出了StaCQ算法,这是一种在D4RL基准数据集上表现优异且与理论命题高度契合的深度学习算法。StaCQ为状态约束离线强化学习的未来探索建立了坚实的基准。