Both entropy-minimizing and entropy-maximizing (curiosity) objectives for unsupervised reinforcement learning (RL) have been shown to be effective in different environments, depending on the environment's level of natural entropy. However, neither method alone results in an agent that will consistently learn intelligent behavior across environments. In an effort to find a single entropy-based method that will encourage emergent behaviors in any environment, we propose an agent that can adapt its objective online, depending on the entropy conditions by framing the choice as a multi-armed bandit problem. We devise a novel intrinsic feedback signal for the bandit, which captures the agent's ability to control the entropy in its environment. We demonstrate that such agents can learn to control entropy and exhibit emergent behaviors in both high- and low-entropy regimes and can learn skillful behaviors in benchmark tasks. Videos of the trained agents and summarized findings can be found on our project page https://sites.google.com/view/surprise-adaptive-agents
翻译:在无监督强化学习(RL)中,无论是熵最小化目标还是熵最大化(好奇心)目标,都已被证明在不同环境中是有效的,其效果取决于环境本身的自然熵水平。然而,单独使用任何一种方法都无法使智能体在不同环境中持续学习到智能行为。为了找到一种基于熵的通用方法,以在任何环境中鼓励涌现行为,我们提出了一种智能体,它能够在线自适应地调整其目标,具体取决于熵条件——我们将这一选择问题构建为一个多臂老虎机问题。我们为老虎机设计了一种新颖的内在反馈信号,该信号捕捉了智能体控制其环境熵的能力。我们证明,此类智能体能够学会控制熵,并在高熵和低熵状态下均表现出涌现行为,同时能在基准测试任务中学习到熟练的技能。训练智能体的视频及研究结果摘要可在我们的项目页面 https://sites.google.com/view/surprise-adaptive-agents 查看。