We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model's state-of-the-art capabilities in zero-shot video generation, specifically highlighting VideoPoet's ability to generate high-fidelity motions. Project page: http://sites.research.google/videopoet/
翻译:我们提出了VideoPoet,这是一种能够根据多种条件信号合成高质量视频并匹配相应音频的语言模型。VideoPoet采用仅解码器的Transformer架构,可处理包括图像、视频、文本和音频在内的多模态输入。其训练协议遵循大语言模型(LLMs)的范式,包含两个阶段:预训练与任务特定适配。在预训练阶段,VideoPoet在自回归Transformer框架中融合了多种多模态生成目标。预训练后的LLM可作为基础模型,适配于一系列视频生成任务。我们展示了实证结果,证明该模型在零样本视频生成方面具备最先进的性能,特别突出了VideoPoet生成高保真运动序列的能力。项目页面:http://sites.research.google/videopoet/