Offline reinforcement learning (RL) offers an appealing approach to real-world tasks by learning policies from pre-collected datasets without interacting with the environment. However, the performance of existing offline RL algorithms heavily depends on the scale and state-action space coverage of datasets. Real-world data collection is often expensive and uncontrollable, leading to small and narrowly covered datasets and posing significant challenges for practical deployments of offline RL. In this paper, we provide a new insight that leveraging the fundamental symmetry of system dynamics can substantially enhance offline RL performance under small datasets. Specifically, we propose a Time-reversal symmetry (T-symmetry) enforced Dynamics Model (TDM), which establishes consistency between a pair of forward and reverse latent dynamics. TDM provides both well-behaved representations for small datasets and a new reliability measure for OOD samples based on compliance with the T-symmetry. These can be readily used to construct a new offline RL algorithm (TSRL) with less conservative policy constraints and a reliable latent space data augmentation procedure. Based on extensive experiments, we find TSRL achieves great performance on small benchmark datasets with as few as 1% of the original samples, which significantly outperforms the recent offline RL algorithms in terms of data efficiency and generalizability.Code is available at: https://github.com/pcheng2/TSRL
翻译:离线强化学习通过从预收集的数据集中学习策略,避免了与环境交互,为现实世界任务提供了极具吸引力的方法。然而,现有离线强化学习算法的性能严重依赖于数据集的规模以及状态-动作空间的覆盖范围。现实世界的数据收集往往成本高昂且难以控制,导致数据集规模小且覆盖范围狭窄,给离线强化学习的实际部署带来了巨大挑战。本文提出,利用系统动力学的基本对称性可以显著提升小数据集下的离线强化学习性能。具体而言,我们提出了一种时间反演对称性(T-对称性)强化的动力学模型(TDM),该模型建立了一对正向和反向潜动力学之间的一致性。TDM不仅为小数据集提供了良好表征,还基于T-对称性的符合度为分布外样本提供了新的可靠性度量。这些特性可用于构建一种新的离线强化学习算法(TSRL),该算法具有更保守的策略约束以及可靠的潜空间数据增强过程。基于大量实验,我们发现TSRL在仅包含原始样本1%的小规模基准数据集上表现出色,在数据效率和泛化能力方面显著优于近期离线强化学习算法。代码开源地址:https://github.com/pcheng2/TSRL