Discovering achievements with a hierarchical structure in procedurally generated environments presents a significant challenge. This requires an agent to possess a broad range of abilities, including generalization and long-term reasoning. Many prior methods have been built upon model-based or hierarchical approaches, with the belief that an explicit module for long-term planning would be advantageous for learning hierarchical dependencies. However, these methods demand an excessive number of environment interactions or large model sizes, limiting their practicality. In this work, we demonstrate that proximal policy optimization (PPO), a simple yet versatile model-free algorithm, outperforms previous methods when optimized with recent implementation practices. Moreover, we find that the PPO agent can predict the next achievement to be unlocked to some extent, albeit with limited confidence. Based on this observation, we introduce a novel contrastive learning method, called achievement distillation, which strengthens the agent's ability to predict the next achievement. Our method exhibits a strong capacity for discovering hierarchical achievements and shows state-of-the-art performance on the challenging Crafter environment in a sample-efficient manner while utilizing fewer model parameters.
翻译:在程序化生成环境中发现具有层级结构的成就是一项重大挑战。这要求智能体具备广泛的能力,包括泛化能力和长期推理能力。许多先前方法基于基于模型或分层方法构建,其核心理念是:显式的长期规划模块有利于学习层级依赖关系。然而,这些方法需要过多的环境交互次数或庞大的模型规模,严重限制了其实用性。在本研究中,我们证明近端策略优化(PPO)——一种简单且通用的无模型算法——在结合最新实现实践优化后,其性能优于先前方法。此外,我们发现PPO智能体在一定程度上能够预测下一个待解锁的成就,尽管置信度有限。基于这一发现,我们提出了一种新型对比学习方法——成就蒸馏,该方法增强了智能体预测下一个成就的能力。我们的方法展现出强大的层级成就发现能力,在具有挑战性的Crafter环境中以样本高效的方式实现了最先进的性能,同时使用了更少的模型参数。