Offline reinforcement learning (RL) offers an appealing approach to real-world tasks by learning policies from pre-collected datasets without interacting with the environment. However, the performance of existing offline RL algorithms heavily depends on the scale and state-action space coverage of datasets. Real-world data collection is often expensive and uncontrollable, leading to small and narrowly covered datasets and posing significant challenges for practical deployments of offline RL. In this paper, we provide a new insight that leveraging the fundamental symmetry of system dynamics can substantially enhance offline RL performance under small datasets. Specifically, we propose a Time-reversal symmetry (T-symmetry) enforced Dynamics Model (TDM), which establishes consistency between a pair of forward and reverse latent dynamics. TDM provides both well-behaved representations for small datasets and a new reliability measure for OOD samples based on compliance with the T-symmetry. These can be readily used to construct a new offline RL algorithm (TSRL) with less conservative policy constraints and a reliable latent space data augmentation procedure. Based on extensive experiments, we find TSRL achieves great performance on small benchmark datasets with as few as 1% of the original samples, which significantly outperforms the recent offline RL algorithms in terms of data efficiency and generalizability.
翻译:离线强化学习通过从预先收集的数据集中学习策略,无需与环境交互,为现实任务提供了一种有吸引力的方法。然而,现有离线强化学习算法的性能严重依赖于数据集的规模及其状态-动作空间覆盖范围。现实世界中的数据收集通常成本高昂且难以控制,导致数据集规模小且覆盖范围狭窄,这给离线强化学习的实际部署带来了重大挑战。本文提出了一项新见解:利用系统动力学中的基本对称性可以显著提升小样本数据集下离线强化学习的性能。具体而言,我们提出了一种时间反演对称性(T-对称性)强化的动力学模型(TDM),该模型建立了正向和反向潜在动力学之间的连贯性。TDM既能为小数据集提供表现良好的表示,又能基于对T-对称性的遵循度为域外样本提供新的可靠性度量。这些特性可被直接用于构建一种保守性策略约束更少的新型离线强化学习算法(TSRL),并伴随可靠的潜在空间数据增强过程。基于大量实验,我们发现TSRL在仅包含原始样本1%的小型基准数据集上取得了优异性能,在数据效率和泛化能力方面显著优于近期提出的离线强化学习算法。