可扩展指令智能体在多模拟世界中的训练 (Scaling Instructable Agents Across Many Simulated Worlds)

SIMA Team,Maria Abi Raad,Arun Ahuja,Catarina Barros,Frederic Besse,Andrew Bolt,Adrian Bolton,Bethanie Brownfield,Gavin Buttimore,Max Cant,Sarah Chakera,Stephanie C. Y. Chan,Jeff Clune,Adrian Collister,Vikki Copeman,Alex Cullum,Ishita Dasgupta,Dario de Cesare,Julia Di Trapani,Yani Donchev,Emma Dunleavy,Martin Engelcke,Ryan Faulkner,Frankie Garcia,Charles Gbadamosi,Zhitao Gong,Lucy Gonzales,Kshitij Gupta,Karol Gregor,Arne Olav Hallingstad,Tim Harley,Sam Haves,Felix Hill,Ed Hirst,Drew A. Hudson,Jony Hudson,Steph Hughes-Fitt,Danilo J. Rezende,Mimi Jasarevic,Laura Kampis,Rosemary Ke,Thomas Keck,Junkyung Kim,Oscar Knagg,Kavya Kopparapu,Rory Lawton,Andrew Lampinen,Shane Legg,Alexander Lerchner,Marjorie Limont,Yulan Liu,Maria Loks-Thompson,Joseph Marino,Kathryn Martin Cussons,Loic Matthey,Siobhan Mcloughlin,Piermaria Mendolicchio,Hamza Merzic,Anna Mitenkova,Alexandre Moufarek,Valeria Oliveira,Yanko Oliveira,Hannah Openshaw,Renke Pan,Aneesh Pappu,Alex Platonov,Ollie Purkiss,David Reichert,John Reid,Pierre Harvey Richemond,Tyson Roberts,Giles Ruscoe,Jaume Sanchez Elias,Tasha Sandars,Daniel P. Sawyer,Tim Scholtes,Guy Simmons,Daniel Slater,Hubert Soyer,Heiko Strathmann,Peter Stys,Allison C. Tam,Denis Teplyashin,Tayfun Terzi,Davide Vercelli,Bojan Vujatovic,Marcus Wainwright,Jane X. Wang,Zhengdong Wang,Daan Wierstra,Duncan Williams,Nathaniel Wong,Sarah York,Nick Young

Building embodied AI systems that can follow arbitrary language instructions in any 3D environment is a key challenge for creating general AI. Accomplishing this goal requires learning to ground language in perception and embodied actions, in order to accomplish complex tasks. The Scalable, Instructable, Multiworld Agent (SIMA) project tackles this by training agents to follow free-form instructions across a diverse range of virtual 3D environments, including curated research environments as well as open-ended, commercial video games. Our goal is to develop an instructable agent that can accomplish anything a human can do in any simulated 3D environment. Our approach focuses on language-driven generality while imposing minimal assumptions. Our agents interact with environments in real-time using a generic, human-like interface: the inputs are image observations and language instructions and the outputs are keyboard-and-mouse actions. This general approach is challenging, but it allows agents to ground language across many visually complex and semantically rich environments while also allowing us to readily run agents in new environments. In this paper we describe our motivation and goal, the initial progress we have made, and promising preliminary results on several diverse research environments and a variety of commercial video games.

翻译：构建能够在任意三维环境中遵循任意语言指令的具身人工智能系统，是实现通用人工智能的关键挑战。达成此目标需要学习将语言与感知及具身行为相结合，以完成复杂任务。可扩展、可指令、多世界智能体项目通过训练智能体在多样化的虚拟三维环境中遵循自由形式的指令来应对这一挑战，这些环境包括精选的研究环境以及开放式的商业电子游戏。我们的目标是开发一种可指令智能体，能够在任何模拟三维环境中完成人类所能完成的一切任务。我们的方法侧重于语言驱动的泛化能力，同时施加最少的假设。我们的智能体通过通用的人机交互界面与环境进行实时交互：输入为图像观测和语言指令，输出为键盘和鼠标动作。这种通用方法具有挑战性，但它使得智能体能够在众多视觉复杂且语义丰富的环境中实现语言与环境的结合，同时也使我们能够轻松地在新的环境中运行智能体。本文阐述了我们的研究动机与目标、已取得的初步进展，以及在多个不同研究环境和各类商业电子游戏中获得的具有前景的初步结果。