We present Emu, a Transformer-based multimodal foundation model, which can seamlessly generate images and texts in multimodal context. This omnivore model can take in any single-modality or multimodal data input indiscriminately (e.g., interleaved image, text and video) through a one-model-for-all autoregressive training process. First, visual signals are encoded into embeddings, and together with text tokens form an interleaved input sequence. Emu is then end-to-end trained with a unified objective of classifying the next text token or regressing the next visual embedding in the multimodal sequence. This versatile multimodality empowers the exploration of diverse pretraining data sources at scale, such as videos with interleaved frames and text, webpages with interleaved images and text, as well as web-scale image-text pairs and video-text pairs. Emu can serve as a generalist multimodal interface for both image-to-text and text-to-image tasks, and supports in-context image and text generation. Across a broad range of zero-shot/few-shot tasks including image captioning, visual question answering, video question answering and text-to-image generation, Emu demonstrates superb performance compared to state-of-the-art large multimodal models. Extended capabilities such as multimodal assistants via instruction tuning are also demonstrated with impressive performance.
翻译:我们提出Emu——一种基于Transformer的多模态基础模型,能够在多模态上下文中无缝生成图像和文本。该全模态模型通过统一的单模型自回归训练过程,可无差别处理任意单模态或多模态数据输入(例如交错排列的图像、文本和视频)。首先,视觉信号被编码为嵌入向量,与文本标记共同构成交错输入序列。随后Emu通过统一的优化目标进行端到端训练:在多模态序列中预测下一个文本标记或回归下一个视觉嵌入。这种多模态通用性使得大规模多样化预训练数据源的探索成为可能,包括交错帧与文本的视频、图文交错的网页,以及大规模图像-文本对和视频-文本对。Emu可作为图像到文本与文本到图像任务的通用多模态接口,并支持上下文内图像和文本生成。在涵盖图像描述、视觉问答、视频问答及文本到图像生成的广泛零样本/少样本任务中,Emu相较于现有最先进大规模多模态模型展现出卓越性能。通过指令微调扩展的多模态助手等能力同样表现出令人瞩目的性能。