Most existing offline RL methods presume the availability of action labels within the dataset, but in many practical scenarios, actions may be missing due to privacy, storage, or sensor limitations. We formalise the setting of action-free offline-to-online RL, where agents must learn from datasets consisting solely of $(s,r,s')$ tuples and later leverage this knowledge during online interaction. To address this challenge, we propose learning state policies that recommend desirable next-state transitions rather than actions. Our contributions are twofold. First, we introduce a simple yet novel state discretisation transformation and propose Offline State-Only DecQN (\algo), a value-based algorithm designed to pre-train state policies from action-free data. \algo{} integrates the transformation to scale efficiently to high-dimensional problems while avoiding instability and overfitting associated with continuous state prediction. Second, we propose a novel mechanism for guided online learning that leverages these pre-trained state policies to accelerate the learning of online agents. Together, these components establish a scalable and practical framework for leveraging action-free datasets to accelerate online RL. Empirical results across diverse benchmarks demonstrate that our approach improves convergence speed and asymptotic performance, while analyses reveal that discretisation and regularisation are critical to its effectiveness.
翻译:现有的大多数离线强化学习方法都假设数据集中包含动作标签,但在许多实际场景中,由于隐私、存储或传感器限制,动作信息可能缺失。我们形式化了无动作离线至在线强化学习的设定,其中智能体必须仅从$(s,r,s')$元组组成的数据集中学习,并在后续的在线交互中利用这些知识。为应对这一挑战,我们提出学习状态策略,该策略推荐期望的下一状态转移而非具体动作。我们的贡献是双重的。首先,我们引入了一种简单而新颖的状态离散化变换,并提出了Offline State-Only DecQN (\algo),这是一种基于价值的算法,旨在从无动作数据中预训练状态策略。\algo{}集成了该变换,能够高效扩展至高维问题,同时避免了连续状态预测相关的不稳定性和过拟合。其次,我们提出了一种新颖的引导式在线学习机制,利用这些预训练的状态策略来加速在线智能体的学习。这些组件共同建立了一个可扩展且实用的框架,用于利用无动作数据集来加速在线强化学习。在多样化基准测试中的实证结果表明,我们的方法提高了收敛速度和渐近性能,同时分析表明离散化和正则化对其有效性至关重要。