Recently, action-based decision-making in open-world environments has gained significant attention. Visual Language Action (VLA) models, pretrained on large-scale web datasets, have shown promise in decision-making tasks. However, previous work has primarily focused on action post-training, often neglecting enhancements to the foundational model itself. In response, we introduce a novel approach, Act from Visual Language Post-Training, which refines Visual Language Models (VLMs) through visual and linguistic guidance in a self-supervised manner. This enhancement improves the models' capabilities in world knowledge, visual recognition, and spatial grounding in open-world environments. Following the above post-training paradigms, we obtain the first VLA models in Minecraft that can follow human instructions on over 1k different atomic tasks, including crafting, smelting, cooking, mining, and killing. Our experiments demonstrate that post-training on non-trajectory tasks leads to a significant 40% improvement over the best agent baseline on a diverse set of atomic tasks. Furthermore, we demonstrate that our approach surpasses traditional imitation learning-based policies in Minecraft, achieving state-of-the-art performance. We have open-sourced the code, models, and datasets to foster further research. The project page can be found in https://craftjarvis.github.io/JarvisVLA.
翻译:近年来,开放世界环境中基于动作的决策研究受到广泛关注。在大规模网络数据集上预训练的视觉语言动作模型已在决策任务中展现出潜力。然而,先前工作主要聚焦于动作后训练,往往忽视对基础模型本身的增强。为此,我们提出一种新方法——基于视觉语言后训练的动作生成,该方法通过视觉与语言引导以自监督方式优化视觉语言模型。这种增强提升了模型在开放世界环境中的世界知识、视觉识别与空间定位能力。基于上述后训练范式,我们获得了首个能在《我的世界》中遵循人类指令完成超过1000种不同原子任务(包括合成、冶炼、烹饪、采矿与战斗)的视觉语言动作模型。实验表明,在非轨迹任务上的后训练使模型在多样化原子任务集上的性能较最佳智能体基线提升40%。此外,我们证明该方法在《我的世界》中超越了传统基于模仿学习的策略,实现了最先进的性能。我们已开源代码、模型与数据集以促进后续研究。项目页面详见 https://craftjarvis.github.io/JarvisVLA。