Offline reinforcement learning (RL) offers an appealing approach to real-world tasks by learning policies from pre-collected datasets without interacting with the environment. However, the performance of existing offline RL algorithms heavily depends on the scale and state-action space coverage of datasets. Real-world data collection is often expensive and uncontrollable, leading to small and narrowly covered datasets and posing significant challenges for practical deployments of offline RL. In this paper, we provide a new insight that leveraging the fundamental symmetry of system dynamics can substantially enhance offline RL performance under small datasets. Specifically, we propose a Time-reversal symmetry (T-symmetry) enforced Dynamics Model (TDM), which establishes consistency between a pair of forward and reverse latent dynamics. TDM provides both well-behaved representations for small datasets and a new reliability measure for OOD samples based on compliance with the T-symmetry. These can be readily used to construct a new offline RL algorithm (TSRL) with less conservative policy constraints and a reliable latent space data augmentation procedure. Based on extensive experiments, we find TSRL achieves great performance on small benchmark datasets with as few as 1% of the original samples, which significantly outperforms the recent offline RL algorithms in terms of data efficiency and generalizability.Code is available at: https://github.com/pcheng2/TSRL
翻译:离线强化学习通过从预先收集的数据集中学习策略而不与环境交互,为实际任务提供了一种有吸引力的方法。然而,现有离线强化学习算法的性能严重依赖于数据集的规模及其状态-动作空间覆盖范围。实际数据收集往往成本高昂且难以控制,导致数据集规模小且覆盖范围狭窄,这给离线强化学习的实际部署带来了巨大挑战。本文提出了一种新见解:利用系统动力学的内在对称性可以显著提升小数据集下的离线强化学习性能。具体而言,我们提出了一种基于时间反演对称性(T-symmetry)增强的动力模型(TDM),该模型建立了正向和反向潜在动力学之间的一致性。TDM不仅能为小数据集提供良好的表征,还能基于对T-symmetry的符合程度为分布外样本提供新的可靠性度量。这些特性可直接用于构建一种新的离线强化学习算法(TSRL),该算法采用更保守的策略约束和可靠的潜在空间数据增强过程。基于大量实验,我们发现TSRL在仅使用原始样本1%的小型基准数据集上取得了优异性能,其在数据效率和泛化能力方面显著优于近期提出的离线强化学习算法。代码地址:https://github.com/pcheng2/TSRL