Offline reinforcement learning (RL) offers an appealing approach to real-world tasks by learning policies from pre-collected datasets without interacting with the environment. However, the performance of existing offline RL algorithms heavily depends on the scale and state-action space coverage of datasets. Real-world data collection is often expensive and uncontrollable, leading to small and narrowly covered datasets and posing significant challenges for practical deployments of offline RL. In this paper, we provide a new insight that leveraging the fundamental symmetry of system dynamics can substantially enhance offline RL performance under small datasets. Specifically, we propose a Time-reversal symmetry (T-symmetry) enforced Dynamics Model (TDM), which establishes consistency between a pair of forward and reverse latent dynamics. TDM provides both well-behaved representations for small datasets and a new reliability measure for OOD samples based on compliance with the T-symmetry. These can be readily used to construct a new offline RL algorithm (TSRL) with less conservative policy constraints and a reliable latent space data augmentation procedure. Based on extensive experiments, we find TSRL achieves great performance on small benchmark datasets with as few as 1% of the original samples, which significantly outperforms the recent offline RL algorithms in terms of data efficiency and generalizability.Code is available at: https://github.com/pcheng2/TSRL
翻译:离线强化学习通过从预先收集的数据集中学习策略而不与环境交互,为现实任务提供了一种有吸引力的方法。然而,现有离线强化学习算法的性能在很大程度上依赖于数据集的规模与状态-动作空间覆盖范围。现实世界的数据收集通常成本高昂且难以控制,导致数据集规模小且覆盖狭窄,这给离线强化学习的实际部署带来了重大挑战。本文提出一种新见解:利用系统动力学的根本对称性能在小型数据集下显著提升离线强化学习性能。具体而言,我们提出一种时间反演对称性(T-对称性)增强的动态模型(TDM),该模型在正向与反向潜在动态之间建立一致性。TDM为小型数据集提供了良好特性的表示,并基于对T-对称性的符合程度,为分布外样本提供了一种新的可靠性度量。这些特性可直接用于构建一种新的离线强化学习算法(TSRL),该算法具有更保守的策略约束和可靠的潜在空间数据增广过程。基于大量实验,我们发现TSRL在仅含原始样本1%的小型基准数据集上取得了优异性能,在数据效率和泛化能力上显著优于近期离线强化学习算法。代码见:https://github.com/pcheng2/TSRL