We present CM3Leon (pronounced "Chameleon"), a retrieval-augmented, token-based, decoder-only multi-modal language model capable of generating and infilling both text and images. CM3Leon uses the CM3 multi-modal architecture but additionally shows the extreme benefits of scaling up and tuning on more diverse instruction-style data. It is the first multi-modal model trained with a recipe adapted from text-only language models, including a large-scale retrieval-augmented pre-training stage and a second multi-task supervised fine-tuning (SFT) stage. It is also a general-purpose model that can do both text-to-image and image-to-text generation, allowing us to introduce self-contained contrastive decoding methods that produce high-quality outputs. Extensive experiments demonstrate that this recipe is highly effective for multi-modal models. CM3Leon achieves state-of-the-art performance in text-to-image generation with 5x less training compute than comparable methods (zero-shot MS-COCO FID of 4.88). After SFT, CM3Leon can also demonstrate unprecedented levels of controllability in tasks ranging from language-guided image editing to image-controlled generation and segmentation.
翻译:我们提出CM3Leon(发音为"Chameleon"),一种检索增强的、基于token的、仅解码器多模态语言模型,能够生成和填充文本与图像。CM3Leon采用CM3多模态架构,但进一步展示了扩展规模并在更多样化指令式数据上微调的极端优势。它是首个采用从纯文本语言模型改编的训练策略的多模态模型,包括大规模检索增强预训练阶段和第二阶段的多任务监督微调(SFT)。该模型同样是一种通用模型,可同时完成文本到图像和图像到文本的生成任务,使我们能够引入自包含的对比解码方法以产生高质量输出。大量实验证明,该训练策略对多模态模型高度有效。CM3Leon在文本到图像生成任务中实现了最先进的性能,训练计算量比同类方法减少5倍(零样本MS-COCO FID为4.88)。经过SFT后,CM3Leon还能在从语言引导的图像编辑到图像控制的生成与分割等任务中展现出前所未有的可控性水平。