A goal of artificial intelligence is to construct an agent that can solve a wide variety of tasks. Recent progress in text-guided image synthesis has yielded models with an impressive ability to generate complex novel images, exhibiting combinatorial generalization across domains. Motivated by this success, we investigate whether such tools can be used to construct more general-purpose agents. Specifically, we cast the sequential decision making problem as a text-conditioned video generation problem, where, given a text-encoded specification of a desired goal, a planner synthesizes a set of future frames depicting its planned actions in the future, after which control actions are extracted from the generated video. By leveraging text as the underlying goal specification, we are able to naturally and combinatorially generalize to novel goals. The proposed policy-as-video formulation can further represent environments with different state and action spaces in a unified space of images, which, for example, enables learning and generalization across a variety of robot manipulation tasks. Finally, by leveraging pretrained language embeddings and widely available videos from the internet, the approach enables knowledge transfer through predicting highly realistic video plans for real robots.
翻译:人工智能的目标之一是构建能够解决广泛任务的智能体。文本引导图像合成的最新进展已产生具有生成复杂新颖图像能力的模型,这些模型展现出跨领域的组合泛化能力。受此成功启发,我们探究此类工具能否用于构建更具通用性的智能体。具体而言,我们将序列决策问题转化为文本条件视频生成问题:给定文本编码的目标规范,规划器合成一组描绘其未来计划动作的未来帧,随后从生成的视频中提取控制动作。通过利用文本作为底层目标规范,我们能够自然地以组合方式泛化到新目标。所提出的"策略即视频"框架可进一步将具有不同状态与动作空间的环境统一表示为图像空间,例如,这使得在多种机器人操作任务中实现学习与泛化成为可能。最后,通过利用预训练语言嵌入及互联网上广泛可用的视频,该方法能够为真实机器人预测高度逼真的视频规划,从而实现知识迁移。