A goal of artificial intelligence is to construct an agent that can solve a wide variety of tasks. Recent progress in text-guided image synthesis has yielded models with an impressive ability to generate complex novel images, exhibiting combinatorial generalization across domains. Motivated by this success, we investigate whether such tools can be used to construct more general-purpose agents. Specifically, we cast the sequential decision making problem as a text-conditioned video generation problem, where, given a text-encoded specification of a desired goal, a planner synthesizes a set of future frames depicting its planned actions in the future, after which control actions are extracted from the generated video. By leveraging text as the underlying goal specification, we are able to naturally and combinatorially generalize to novel goals. The proposed policy-as-video formulation can further represent environments with different state and action spaces in a unified space of images, which, for example, enables learning and generalization across a variety of robot manipulation tasks. Finally, by leveraging pretrained language embeddings and widely available videos from the internet, the approach enables knowledge transfer through predicting highly realistic video plans for real robots.
翻译:人工智能的目标是构建能够解决广泛任务的智能体。文本引导图像合成领域的最新进展,已催生出具备生成复杂新颖图像惊人能力的模型,展现出跨领域的组合泛化能力。受此成功启发,我们探索此类工具能否用于构建更具通用性的智能体。具体而言,我们将序列决策问题转化为文本条件视频生成问题:给定以文本编码形式呈现的目标规范说明,规划器合成一系列描绘其未来计划动作的后续帧,随后从生成的视频中提取控制动作。通过利用文本作为底层目标规范,我们能够自然且组合性地泛化至新颖目标。所提出的"策略即视频"框架可进一步将不同状态与动作空间的环境统一表征为图像空间,例如支持跨多种机器人操作任务的学习与泛化。最后,通过利用预训练语言嵌入及互联网上广泛可用的视频数据,该方法能够为真实机器人预测高度逼真的视频规划方案,从而实现知识迁移。