Self-supervised methods have become crucial for advancing deep learning by leveraging data itself to reduce the need for expensive annotations. However, the question of how to conduct self-supervised offline reinforcement learning (RL) in a principled way remains unclear. In this paper, we address this issue by investigating the theoretical benefits of utilizing reward-free data in linear Markov Decision Processes (MDPs) within a semi-supervised setting. Further, we propose a novel, Provable Data Sharing algorithm (PDS) to utilize such reward-free data for offline RL. PDS uses additional penalties on the reward function learned from labeled data to prevent overestimation, ensuring a conservative algorithm. Our results on various offline RL tasks demonstrate that PDS significantly improves the performance of offline RL algorithms with reward-free data. Overall, our work provides a promising approach to leveraging the benefits of unlabeled data in offline RL while maintaining theoretical guarantees. We believe our findings will contribute to developing more robust self-supervised RL methods.
翻译:自监督方法通过利用数据本身来减少对昂贵标注的需求,已成为推动深度学习发展的关键。然而,如何以原则性方式进行自监督离线强化学习仍不明确。本文通过研究在线性马尔可夫决策过程中利用无奖励数据的理论优势(在半监督设置下)来解决该问题。进一步,我们提出了一种新颖的可证明数据共享算法(PDS),用于将此类无奖励数据应用于离线强化学习。PDS通过对从标注数据中学习到的奖励函数施加额外惩罚来防止过估计,从而确保算法的保守性。我们在多种离线强化学习任务上的结果表明,PDS显著提升了利用无奖励数据的离线强化学习算法的性能。总体而言,我们的工作为在保持理论保证的同时利用未标注数据在离线强化学习中的优势提供了一种有前景的方法。我们相信,这些发现将有助于开发更鲁棒的自监督强化学习方法。