In many real-world scenarios, Reinforcement Learning (RL) algorithms are trained on data with dynamics shift, i.e., with different underlying environment dynamics. A majority of current methods address such issue by training context encoders to identify environment parameters. Data with dynamics shift are separated according to their environment parameters to train the corresponding policy. However, these methods can be sample inefficient as data are used \textit{ad hoc}, and policies trained for one dynamics cannot benefit from data collected in all other environments with different dynamics. In this paper, we find that in many environments with similar structures and different dynamics, optimal policies have similar stationary state distributions. We exploit such property and learn the stationary state distribution from data with dynamics shift for efficient data reuse. Such distribution is used to regularize the policy trained in a new environment, leading to the SRPO (\textbf{S}tate \textbf{R}egularized \textbf{P}olicy \textbf{O}ptimization) algorithm. To conduct theoretical analyses, the intuition of similar environment structures is characterized by the notion of homomorphous MDPs. We then demonstrate a lower-bound performance guarantee on policies regularized by the stationary state distribution. In practice, SRPO can be an add-on module to context-based algorithms in both online and offline RL settings. Experimental results show that SRPO can make several context-based algorithms far more data efficient and significantly improve their overall performance.
翻译:在许多现实场景中,强化学习(RL)算法所训练的数据存在动力学转移,即底层环境动力学不同。当前多数方法通过训练上下文编码器来识别环境参数以解决该问题,并将具有不同动力学转移的数据按环境参数分离,进而训练相应策略。然而,这些方法可能因数据被特定化使用而样本效率低下,且针对某一种动力学训练的策略无法受益于其他不同动力学环境中收集的数据。本文发现,在众多结构相似但动力学不同的环境中,最优策略具有相似的稳态状态分布。我们利用这一特性,从具有动力学转移的数据中学习稳态状态分布,以实现高效的数据复用。该分布用于正则化在新环境中训练的策略,进而提出了SRPO(状态正则化策略优化)算法。为进行理论分析,我们通过同态马尔可夫决策过程(MDP)的概念刻画了环境结构相似的直观理解,并证明了经稳态状态分布正则化的策略具有性能下界保证。在实际应用中,SRPO可作为在线和离线RL设置中基于上下文的算法的附加模块。实验结果表明,SRPO能够显著提升多种基于上下文的算法的数据效率,并全面改善其整体性能。