Discovering achievements with a hierarchical structure in procedurally generated environments presents a significant challenge. This requires an agent to possess a broad range of abilities, including generalization and long-term reasoning. Many prior methods have been built upon model-based or hierarchical approaches, with the belief that an explicit module for long-term planning would be advantageous for learning hierarchical dependencies. However, these methods demand an excessive number of environment interactions or large model sizes, limiting their practicality. In this work, we demonstrate that proximal policy optimization (PPO), a simple yet versatile model-free algorithm, outperforms previous methods when optimized with recent implementation practices. Moreover, we find that the PPO agent can predict the next achievement to be unlocked to some extent, albeit with limited confidence. Based on this observation, we introduce a novel contrastive learning method, called achievement distillation, which strengthens the agent's ability to predict the next achievement. Our method exhibits a strong capacity for discovering hierarchical achievements and shows state-of-the-art performance on the challenging Crafter environment in a sample-efficient manner while utilizing fewer model parameters.
翻译:在程序生成环境中发现具有层级结构的成就具有显著挑战性。这要求智能体具备广泛的能力,包括泛化能力和长期推理能力。许多先前的方法基于模型驱动或层级方法构建,其核心理念是认为明确的长期规划模块将有助于学习层级依赖关系。然而,这些方法需要过量的环境交互次数或庞大的模型规模,限制了其实用性。在本研究中,我们证明近端策略优化(PPO)——一种简单而通用的无模型算法——在采用最新实现实践进行优化后,其性能优于先前方法。此外,我们发现PPO智能体能在一定程度上预测下一个待解锁的成就,尽管置信度有限。基于这一观察,我们提出了一种名为成就蒸馏的新型对比学习方法,该方法强化了智能体预测下一个成就的能力。我们的方法在发现层级成就方面展现出强大能力,并在具有挑战性的Crafter环境中以样本高效的方式实现了最先进的性能,同时使用的模型参数更少。