Discovering achievements with a hierarchical structure on procedurally generated environments poses a significant challenge. This requires agents to possess a broad range of abilities, including generalization and long-term reasoning. Many prior methods are built upon model-based or hierarchical approaches, with the belief that an explicit module for long-term planning would be beneficial for learning hierarchical achievements. However, these methods require an excessive amount of environment interactions or large model sizes, limiting their practicality. In this work, we identify that proximal policy optimization (PPO), a simple and versatile model-free algorithm, outperforms the prior methods with recent implementation practices. Moreover, we find that the PPO agent can predict the next achievement to be unlocked to some extent, though with low confidence. Based on this observation, we propose a novel contrastive learning method, called achievement distillation, that strengthens the agent's capability to predict the next achievement. Our method exhibits a strong capacity for discovering hierarchical achievements and shows state-of-the-art performance on the challenging Crafter environment using fewer model parameters in a sample-efficient regime.
翻译:在程序化生成环境中发现具有层级结构的成就构成了重大挑战。这要求智能体具备包括泛化能力和长期推理在内的广泛能力。先前许多方法基于模型驱动或分层方法,认为显式的长期规划模块有助于学习层级成就。然而,这些方法需要过多的环境交互或庞大的模型规模,限制了其实用性。在本研究中,我们发现近端策略优化(一种简单通用的无模型算法)结合近期实现实践后,其表现优于先前方法。此外,我们观察到PPO智能体能够在一定程度上预测下一个待解锁的成就,尽管置信度较低。基于这一发现,我们提出了一种名为“成就蒸馏”的新型对比学习方法,该方法能增强智能体预测下一个成就的能力。我们的方法展现出强大的层级成就发现能力,在具有挑战性的Crafter环境中,以更少的模型参数和样本高效的模式取得了最先进的性能。