In many real-world scenarios, Reinforcement Learning (RL) algorithms are trained on data with dynamics shift, i.e., with different underlying environment dynamics. A majority of current methods address such issue by training context encoders to identify environment parameters. Data with dynamics shift are separated according to their environment parameters to train the corresponding policy. However, these methods can be sample inefficient as data are used \textit{ad hoc}, and policies trained for one dynamics cannot benefit from data collected in all other environments with different dynamics. In this paper, we find that in many environments with similar structures and different dynamics, optimal policies have similar stationary state distributions. We exploit such property and learn the stationary state distribution from data with dynamics shift for efficient data reuse. Such distribution is used to regularize the policy trained in a new environment, leading to the SRPO (\textbf{S}tate \textbf{R}egularized \textbf{P}olicy \textbf{O}ptimization) algorithm. To conduct theoretical analyses, the intuition of similar environment structures is characterized by the notion of homomorphous MDPs. We then demonstrate a lower-bound performance guarantee on policies regularized by the stationary state distribution. In practice, SRPO can be an add-on module to context-based algorithms in both online and offline RL settings. Experimental results show that SRPO can make several context-based algorithms far more data efficient and significantly improve their overall performance.
翻译:在众多现实场景中,强化学习算法常面临具有动力学迁移的数据训练问题,即底层环境动力学存在差异。当前主流方法通过训练上下文编码器识别环境参数,将动力学迁移数据按环境参数分离后分别训练对应策略。然而这类方法存在样本效率低下问题——数据被单一化使用,针对某动力学的策略无法利用所有不同动力学环境收集的数据。本文发现,在结构相似且动力学不同的环境中,最优策略具有相近的稳态状态分布。我们利用该特性从动力学迁移数据中学习稳态分布实现高效数据复用,并通过该分布约束新环境下的策略训练,提出SRPO(状态正则化策略优化)算法。为开展理论分析,我们采用同态马尔可夫决策过程概念刻画环境结构相似性,进而证明经过稳态分布正则化的策略具有性能下限保证。实际应用中,SRPO可作为上下文基算法的附加模块,在在线与离线强化学习场景中均可部署。实验表明,SRPO能显著提升多种上下文基算法的数据效率与综合性能。