Many environments contain numerous available niches of variable value, each associated with a different local optimum in the space of behaviors (policy space). In such situations it is often difficult to design a learning process capable of evading distraction by poor local optima long enough to stumble upon the best available niche. In this work we propose a generic reinforcement learning (RL) algorithm that performs better than baseline deep Q-learning algorithms in such environments with multiple variably-valued niches. The algorithm we propose consists of two parts: an agent architecture and a learning rule. The agent architecture contains multiple sub-policies. The learning rule is inspired by fitness sharing in evolutionary computation and applied in reinforcement learning using Value-Decomposition-Networks in a novel manner for a single-agent's internal population. It can concretely be understood as adding an extra loss term where one policy's experience is also used to update all the other policies in a manner that decreases their value estimates for the visited states. In particular, when one sub-policy visits a particular state frequently this decreases the value predicted for other sub-policies for going to that state. Further, we introduce an artificial chemistry inspired platform where it is easy to create tasks with multiple rewarding strategies utilizing different resources (i.e. multiple niches). We show that agents trained this way can escape poor-but-attractive local optima to instead converge to harder-to-discover higher value strategies in both the artificial chemistry environments and in simpler illustrative environments.
翻译:许多环境包含大量不同价值的可用生态位,每个生态位都与行为空间(策略空间)中的不同局部最优值相关。在此类情境中,设计一种能够长期避免被低质量局部最优值干扰,最终偶然发现最佳生态位的学习过程往往非常困难。本文提出一种通用强化学习算法,在存在多个不同价值生态位的此类环境中,其性能优于基线深度Q学习算法。该算法由两部分组成:智能体架构和学习规则。智能体架构包含多个子策略。学习规则受进化计算中适应度共享启发,并以新颖方式将价值分解网络应用于单智能体内部种群。具体而言,该规则可理解为引入额外损失项:一个策略的探索经验被同时用于更新所有其他策略,从而降低这些策略对已访问状态的价值估计。当某个子策略频繁访问特定状态时,其会降低其他子策略对该状态的价值预测值。此外,我们引入一个受人工化学启发的实验平台,该平台可轻松构建包含多种利用不同资源的奖励策略(即多个生态位)的任务。实验表明,在人工化学环境及更简化的示例环境中,以此方式训练的智能体能够逃离低价值但具吸引力的局部最优值,转而收敛至更难发现的高价值策略。