VideoPoet: A Large Language Model for Zero-Shot Video Generation

Dan Kondratyuk,Lijun Yu,Xiuye Gu,José Lezama,Jonathan Huang,Rachel Hornung,Hartwig Adam,Hassan Akbari,Yair Alon,Vighnesh Birodkar,Yong Cheng,Ming-Chang Chiu,Josh Dillon,Irfan Essa,Agrim Gupta,Meera Hahn,Anja Hauth,David Hendon,Alonso Martinez,David Minnen,David Ross,Grant Schindler,Mikhail Sirotenko,Kihyuk Sohn,Krishna Somandepalli,Huisheng Wang,Jimmy Yan,Ming-Hsuan Yang,Xuan Yang,Bryan Seybold,Lu Jiang

from arxiv, Project page: http://sites.research.google/videopoet/

We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model's state-of-the-art capabilities in zero-shot video generation, specifically highlighting VideoPoet's ability to generate high-fidelity motions. Project page: http://sites.research.google/videopoet/

翻译：我们提出VideoPoet，一种能够从多种条件信号合成高质量视频及匹配音频的语言模型。VideoPoet采用仅解码器架构的Transformer，可处理图像、视频、文本和音频等多模态输入。其训练流程遵循大型语言模型（LLMs）的范式，包含预训练和任务特定适应两个阶段。在预训练阶段，VideoPoet在自回归Transformer框架中融合了多模态生成目标的混合策略。预训练的LLM可作为基础模型，适应多种视频生成任务。本文通过实证结果展示了该模型在零样本视频生成中的最先进能力，特别强调了VideoPoet生成高保真运动效果的性能。项目页面：http://sites.research.google/videopoet/

相关内容

大语言模型

关注 67

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

14+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日