TD-MPC is a model-based reinforcement learning (RL) algorithm that performs local trajectory optimization in the latent space of a learned implicit (decoder-free) world model. In this work, we present TD-MPC2: a series of improvements upon the TD-MPC algorithm. We demonstrate that TD-MPC2 improves significantly over baselines across 104 online RL tasks spanning 4 diverse task domains, achieving consistently strong results with a single set of hyperparameters. We further show that agent capabilities increase with model and data size, and successfully train a single 317M parameter agent to perform 80 tasks across multiple task domains, embodiments, and action spaces. We conclude with an account of lessons, opportunities, and risks associated with large TD-MPC2 agents. Explore videos, models, data, code, and more at https://tdmpc2.com
翻译:TD-MPC是一种基于模型的强化学习算法,该算法在学习的隐式(无解码器)世界模型的潜在空间中进行局部轨迹优化。本文提出TD-MPC2:一系列对TD-MPC算法的改进。我们证明,在涵盖4个不同任务领域的104个在线强化学习任务中,TD-MPC2相较于基线方法有显著提升,且仅用单一超参数集即可实现稳定优异的结果。进一步研究表明,智能体能力随模型和数据规模增长,我们成功训练了包含3.17亿参数的单智能体,使其跨多个任务领域、具身形态和动作空间执行80项任务。最后,我们总结了大规模TD-MPC2智能体相关的经验教训、机遇与风险。探索视频、模型、数据、代码及更多信息请访问https://tdmpc2.com