We present Emu, a Transformer-based multimodal foundation model, which can seamlessly generate images and texts in multimodal context. This omnivore model can take in any single-modality or multimodal data input indiscriminately (e.g., interleaved image, text and video) through a one-model-for-all autoregressive training process. First, visual signals are encoded into embeddings, and together with text tokens form an interleaved input sequence. Emu is then end-to-end trained with a unified objective of classifying the next text token or regressing the next visual embedding in the multimodal sequence. This versatile multimodality empowers the exploration of diverse pretraining data sources at scale, such as videos with interleaved frames and text, webpages with interleaved images and text, as well as web-scale image-text pairs and video-text pairs. Emu can serve as a generalist multimodal interface for both image-to-text and text-to-image tasks, and supports in-context image and text generation. Across a broad range of zero-shot/few-shot tasks including image captioning, visual question answering, video question answering and text-to-image generation, Emu demonstrates superb performance compared to state-of-the-art large multimodal models. Extended capabilities such as multimodal assistants via instruction tuning are also demonstrated with impressive performance.
翻译:我们提出Emu,一种基于Transformer的多模态基础模型,能够在多模态上下文中无缝生成图像和文本。该全能模型通过“一模型适用所有”的自回归训练过程,能够无差别地处理任意单模态或多模态数据输入(例如交错的图像、文本和视频)。首先,视觉信号被编码为嵌入向量,与文本标记共同构成交错的输入序列。随后,Emu通过统一目标进行端到端训练,该目标包括预测多模态序列中的下一个文本标记或回归下一个视觉嵌入向量。这种通用的多模态能力使得在大规模多样化预训练数据源上的探索成为可能,例如交错帧与文本的视频、交错的图像与文本的网页,以及网络规模的图像-文本对与视频-文本对。Emu可作为图像到文本和文本到图像任务的通用多模态接口,并支持上下文内的图像与文本生成。在包括图像描述、视觉问答、视频问答和文本到图像生成在内的广泛零样本/少样本任务中,与最先进的大规模多模态模型相比,Emu展现了卓越的性能。此外,通过指令微调实现的多模态助手等扩展能力也表现出令人瞩目的性能。