Recent embodied agents are primarily built based on reinforcement learning (RL) or large language models (LLMs). Among them, RL agents are efficient for deployment but only perform very few tasks. By contrast, giant LLM agents (often more than 1000B parameters) present strong generalization while demanding enormous computing resources. In this work, we combine their advantages while avoiding the drawbacks by conducting the proposed referee RL on our developed large auto-regressive model (LARM). Specifically, LARM is built upon a lightweight LLM (fewer than 5B parameters) and directly outputs the next action to execute rather than text. We mathematically reveal that classic RL feedbacks vanish in long-horizon embodied exploration and introduce a giant LLM based referee to handle this reward vanishment during training LARM. In this way, LARM learns to complete diverse open-world tasks without human intervention. Especially, LARM successfully harvests enchanted diamond equipment in Minecraft, which demands significantly longer decision-making chains than the highest achievements of prior best methods.
翻译:当前的具身智能体主要基于强化学习或大型语言模型构建。其中,强化学习智能体部署效率高,但仅能执行极少任务;而巨型大型语言模型智能体(通常参数量超过1000B)虽展现出强大的泛化能力,却需要巨大的计算资源。本研究通过在我们开发的大型自回归模型上实施所提出的裁判强化学习,融合了两者的优势并规避了其缺陷。具体而言,LARM基于轻量级大型语言模型(少于5B参数)构建,直接输出待执行的下一个动作而非文本。我们从数学上揭示了经典强化学习反馈在长视野具身探索中会消失的问题,并引入基于巨型大型语言模型的裁判机制以解决训练LARM过程中的奖励消失问题。通过这种方式,LARM能够学会在无需人工干预的情况下完成多样化的开放世界任务。特别值得注意的是,LARM在《我的世界》中成功获取了附魔钻石装备,该任务所需的决策链长度显著超越了现有最佳方法的最高成就。