Markov Decision Process (MDP) presents a mathematical framework to formulate the learning processes of agents in reinforcement learning. MDP is limited by the Markovian assumption that a reward only depends on the immediate state and action. However, a reward sometimes depends on the history of states and actions, which may result in the decision process in a non-Markovian environment. In such environments, agents receive rewards via temporally-extended behaviors sparsely, and the learned policies may be similar. This leads the agents acquired with similar policies generally overfit to the given task and can not quickly adapt to perturbations of environments. To resolve this problem, this paper tries to learn the diverse policies from the history of state-action pairs under a non-Markovian environment, in which a policy dispersion scheme is designed for seeking diverse policy representation. Specifically, we first adopt a transformer-based method to learn policy embeddings. Then, we stack the policy embeddings to construct a dispersion matrix to induce a set of diverse policies. Finally, we prove that if the dispersion matrix is positive definite, the dispersed embeddings can effectively enlarge the disagreements across policies, yielding a diverse expression for the original policy embedding distribution. Experimental results show that this dispersion scheme can obtain more expressive diverse policies, which then derive more robust performance than recent learning baselines under various learning environments.
翻译:马尔可夫决策过程(MDP)为强化学习中智能体的学习过程提供了数学框架,但其受限于马尔可夫假设:奖励仅取决于当前状态和动作。然而,在某些情况下,奖励取决于状态与动作的历史记录,这可能导致决策过程处于非马尔可夫环境。在此类环境中,智能体通过稀疏的延时行为获得奖励,所习得的策略可能趋于相似,导致智能体陷入对特定任务的过拟合,无法快速适应环境扰动。为解决该问题,本文尝试从非马尔可夫环境下的状态-动作对历史中学习多样化策略,并设计了一种策略分散方案以寻求多样性策略表征。具体而言,我们首先采用基于Transformer的方法学习策略嵌入;其次,通过堆叠策略嵌入构建分散矩阵,以诱导生成一组多样化策略;最后,我们证明若分散矩阵为正定矩阵,则分散后的嵌入可有效扩大策略间的分歧,从而对原始策略嵌入分布形成多样化表达。实验结果表明,该分散方案能获得更具表现力的多样化策略,相较于近期学习基线,在多种学习环境下展现出更强的鲁棒性。