We propose the Thinker algorithm, a novel approach that enables reinforcement learning agents to autonomously interact with and utilize a learned world model. The Thinker algorithm wraps the environment with a world model and introduces new actions designed for interacting with the world model. These model-interaction actions enable agents to perform planning by proposing alternative plans to the world model before selecting a final action to execute in the environment. This approach eliminates the need for hand-crafted planning algorithms by enabling the agent to learn how to plan autonomously and allows for easy interpretation of the agent's plan with visualization. We demonstrate the algorithm's effectiveness through experimental results in the game of Sokoban and the Atari 2600 benchmark, where the Thinker algorithm achieves state-of-the-art performance and competitive results, respectively. Visualizations of agents trained with the Thinker algorithm demonstrate that they have learned to plan effectively with the world model to select better actions. The algorithm's generality opens a new research direction on how a world model can be used in reinforcement learning and how planning can be seamlessly integrated into an agent's decision-making process.
翻译:我们提出 Thinker 算法,这是一种新颖的方法,能使强化学习代理自主地与学习到的世界模型进行交互并加以利用。该算法将环境封装在世界模型中,并引入专为与世界模型交互而设计的新动作。这些模型交互动作使代理能够在选择最终在环境中执行的行动之前,通过向世界模型提出替代计划来进行规划。这种方法无需手工设计的规划算法,而是让代理自主学会如何规划,并通过可视化轻松解读代理的计划。我们在推箱子游戏和 Atari 2600 基准测试中的实验结果表明了该算法的有效性:Thinker 算法分别达到了最先进的性能和具有竞争力的结果。使用 Thinker 算法训练的代理的可视化表明,它们已学会利用世界模型进行有效规划,以选择更优的行动。该算法的通用性开辟了一个新的研究方向,即如何将世界模型用于强化学习,以及如何将规划无缝集成到代理的决策过程中。