In this paper, we propose \textbf{UniCode}, a novel approach within the domain of multimodal large language models (MLLMs) that learns a unified codebook to efficiently tokenize visual, text, and potentially other types of signals. This innovation addresses a critical limitation in existing MLLMs: their reliance on a text-only codebook, which restricts MLLM's ability to generate images and texts in a multimodal context. Towards this end, we propose a language-driven iterative training paradigm, coupled with an in-context pre-training task we term ``image decompression'', enabling our model to interpret compressed visual data and generate high-quality images.The unified codebook empowers our model to extend visual instruction tuning to non-linguistic generation tasks. Moreover, UniCode is adaptable to diverse stacked quantization approaches in order to compress visual signals into a more compact token representation. Despite using significantly fewer parameters and less data during training, Unicode demonstrates promising capabilities in visual reconstruction and generation. It also achieves performances comparable to leading MLLMs across a spectrum of VQA benchmarks.
翻译:本文提出了一种名为\textbf{UniCode}的创新方法,该方法属于多模态大语言模型(MLLMs)领域,通过学习一个统一码本以高效地对视觉、文本及其他潜在类型的信号进行标记化。这一创新解决了现有MLLMs的一个关键局限:它们依赖仅适用于文本的码本,这限制了MLLMs在多模态上下文中生成图像和文本的能力。为此,我们提出了一种语言驱动的迭代训练范式,并配备了一项称为“图像解压缩”的上下文预训练任务,使模型能够解读压缩后的视觉数据并生成高质量图像。该统一码本使我们的模型得以将视觉指令微调拓展至非语言生成任务。此外,UniCode可适配多种堆叠量化方法,从而将视觉信号压缩为更紧凑的标记表示。尽管在训练过程中使用的参数和训练数据均显著更少,Unicode仍展现出令人瞩目的视觉重建与生成能力,并在多项VQA基准测试中达到了与领先MLLMs相当的性能水平。