Learning predictive world models is crucial for enhancing the planning capabilities of reinforcement learning (RL) agents. Recently, MuZero-style algorithms, leveraging the value equivalence principle and Monte Carlo Tree Search (MCTS), have achieved superhuman performance in various domains. However, these methods struggle to scale in heterogeneous scenarios with diverse dependencies and task variability. To overcome these limitations, we introduce UniZero, a novel approach that employs a modular transformer-based world model to effectively learn a shared latent space. By concurrently predicting latent dynamics and decision-oriented quantities conditioned on the learned latent history, UniZero enables joint optimization of the long-horizon world model and policy, facilitating broader and more efficient planning in the latent space. We show that UniZero significantly outperforms existing baselines in benchmarks that require long-term memory. Additionally, UniZero demonstrates superior scalability in multitask learning experiments conducted on Atari benchmarks. In standard single-task RL settings, such as Atari and DMControl, UniZero matches or even surpasses the performance of current state-of-the-art methods. Finally, extensive ablation studies and visual analyses validate the effectiveness and scalability of UniZero's design choices. Our code is available at \textcolor{magenta}{https://github.com/opendilab/LightZero}.
翻译:学习预测性世界模型对于增强强化学习(RL)智能体的规划能力至关重要。近年来,基于价值等价原理和蒙特卡洛树搜索(MCTS)的MuZero系列算法已在多个领域实现了超越人类的表现。然而,这些方法在具有多样化依赖关系和任务可变性的异构场景中难以有效扩展。为克服这些局限性,我们提出了UniZero,一种采用模块化Transformer架构世界模型以有效学习共享潜在空间的新方法。通过学习到的潜在历史为条件,同时预测潜在动态和面向决策的量值,UniZero实现了长时域世界模型与策略的联合优化,从而在潜在空间中实现更广泛且更高效的规划。我们证明,在需要长期记忆的基准测试中,UniZero显著优于现有基线方法。此外,在Atari基准上进行的多任务学习实验中,UniZero展现出卓越的可扩展性。在标准单任务RL设置(如Atari和DMControl)中,UniZero达到甚至超越了当前最先进方法的性能。最后,大量的消融研究和可视化分析验证了UniZero设计选择的有效性和可扩展性。我们的代码公开于 \textcolor{magenta}{https://github.com/opendilab/LightZero}。