Markov Decision Process (MDP) presents a mathematical framework to formulate the learning processes of agents in reinforcement learning. MDP is limited by the Markovian assumption that a reward only depends on the immediate state and action. However, a reward sometimes depends on the history of states and actions, which may result in the decision process in a non-Markovian environment. In such environments, agents receive rewards via temporally-extended behaviors sparsely, and the learned policies may be similar. This leads the agents acquired with similar policies generally overfit to the given task and can not quickly adapt to perturbations of environments. To resolve this problem, this paper tries to learn the diverse policies from the history of state-action pairs under a non-Markovian environment, in which a policy dispersion scheme is designed for seeking diverse policy representation. Specifically, we first adopt a transformer-based method to learn policy embeddings. Then, we stack the policy embeddings to construct a dispersion matrix to induce a set of diverse policies. Finally, we prove that if the dispersion matrix is positive definite, the dispersed embeddings can effectively enlarge the disagreements across policies, yielding a diverse expression for the original policy embedding distribution. Experimental results show that this dispersion scheme can obtain more expressive diverse policies, which then derive more robust performance than recent learning baselines under various learning environments.
翻译:马尔可夫决策过程(MDP)为强化学习中智能体的学习过程提供了一个数学建模框架。MDP受限于马尔可夫假设,即奖励仅取决于当前状态与动作。然而,奖励有时依赖于状态与动作的历史序列,这可能导致决策过程处于非马尔可夫环境中。在此类环境中,智能体通过稀疏的时序扩展行为获得奖励,所学策略可能趋于相似。这导致习得相似策略的智能体通常对给定任务过拟合,且难以快速适应环境扰动。为解决该问题,本文尝试在非马尔可夫环境下从状态-动作对历史中学习多样化策略,其中设计了一种策略分散方案以寻求多样化的策略表示。具体而言,我们首先采用基于Transformer的方法学习策略嵌入;随后,堆叠策略嵌入以构建分散矩阵,从而引导出一组多样化策略;最后,我们证明若该分散矩阵正定,则分散后的嵌入能有效扩大策略间的差异性,为原始策略嵌入分布生成多样化表达。实验结果表明,该分散方案能够获得表达能力更强的多样化策略,进而在多种学习环境下取得比近期学习基线更鲁棒的性能。