Planning is a data efficient decision-making strategy where an agent selects candidate actions by exploring possible future states. To simulate future states when there is a high-dimensional action space, the knowledge of one's decision making strategy must be used to limit the number of actions to be explored. We refer to the model used to simulate one's decisions as the agent's self-model. While self-models are implicitly used widely in conjunction with world models to plan actions, it remains unclear how self-models should be designed. Inspired by current reinforcement learning approaches and neuroscience, we explore the benefits and limitations of using a distilled policy network as the self-model. In such dual-policy agents, a model-free policy and a distilled policy are used for model-free actions and planned actions, respectively. Our results on a ecologically relevant, parametric environment indicate that distilled policy network for self-model stabilizes training, has faster inference than using model-free policy, promotes better exploration, and could learn a comprehensive understanding of its own behaviors, at the cost of distilling a new network apart from the model-free policy.
翻译:规划是一种数据高效的决策策略,智能体通过探索可能的未来状态来选择候选动作。在高维动作空间的情况下,为了模拟未来状态,必须利用自身的决策策略知识来限制待探索的动作数量。我们将用于模拟自身决策的模型称为智能体的自我模型。尽管自我模型在结合世界模型进行动作规划时已得到广泛隐式应用,但其设计方式仍不明确。受当前强化学习方法与神经科学的启发,我们探讨了使用蒸馏策略网络作为自我模型的优势与局限。在这种双重策略智能体中,无模型策略和蒸馏策略分别用于无模型动作与规划动作。我们在一个生态相关参数化环境上的结果表明,将蒸馏策略网络作为自我模型能够稳定训练、比使用无模型策略推理更快、促进更好的探索,并能学习对其自身行为的全面理解,但代价是需要从无模型策略中额外蒸馏出一个新的网络。