OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents

This paper presents OmniJARVIS, a novel Vision-Language-Action (VLA) model for open-world instruction-following agents in Minecraft. Compared to prior works that either emit textual goals to separate controllers or produce the control command directly, OmniJARVIS seeks a different path to ensure both strong reasoning and efficient decision-making capabilities via unified tokenization of multimodal interaction data. First, we introduce a self-supervised approach to learn a behavior encoder that produces discretized tokens for behavior trajectories $\tau = \{o_0, a_0, \dots\}$ and an imitation learning policy decoder conditioned on these tokens. These additional behavior tokens will be augmented to the vocabulary of pretrained Multimodal Language Models. With this encoder, we then pack long-term multimodal interactions involving task instructions, memories, thoughts, observations, textual responses, behavior trajectories, etc into unified token sequences and model them with autoregressive transformers. Thanks to the semantically meaningful behavior tokens, the resulting VLA model, OmniJARVIS, can reason (by producing chain-of-thoughts), plan, answer questions, and act (by producing behavior tokens for the imitation learning policy decoder). OmniJARVIS demonstrates excellent performances on a comprehensive collection of atomic, programmatic, and open-ended tasks in open-world Minecraft. Our analysis further unveils the crucial design principles in interaction data formation, unified tokenization, and its scaling potentials. The dataset, models, and code will be released at https://craftjarvis.org/OmniJARVIS.

翻译：本文提出了OmniJARVIS，一种用于《我的世界》开放世界指令跟随智能体的新型视觉-语言-动作模型。相较于先前通过输出文本目标至独立控制器或直接生成控制指令的研究，OmniJARVIS通过统一多模态交互数据的表征化，探索了一条兼顾强大推理能力与高效决策能力的新路径。首先，我们引入一种自监督方法，训练行为编码器以生成行为轨迹 $\tau = \{o_0, a_0, \dots\}$ 的离散化表征，并构建基于这些表征的条件模仿学习策略解码器。这些新增的行为表征将被扩展至预训练多模态语言模型的词表中。借助该编码器，我们将包含任务指令、记忆、思考、观察、文本响应、行为轨迹等的长期多模态交互数据，打包为统一的表征序列，并通过自回归Transformer进行建模。得益于具有语义意义的行为表征，最终得到的视觉-语言-动作模型OmniJARVIS能够进行推理（通过生成思维链）、规划、回答问题并执行动作（通过为模仿学习策略解码器生成行为表征）。OmniJARVIS在开放世界《我的世界》中的原子任务、程序性任务及开放式任务综合测试集上均展现出卓越性能。我们的分析进一步揭示了交互数据构建、统一表征化及其扩展潜力的关键设计原则。数据集、模型及代码将在 https://craftjarvis.org/OmniJARVIS 发布。