Large Language Models (LLMs), benefiting from the auto-regressive modelling approach performed on massive unannotated texts corpora, demonstrates powerful perceptual and reasoning capabilities. However, as for extending auto-regressive modelling to multi-modal scenarios to build Large Multi-modal Models (LMMs), there lies a great difficulty that the image information is processed in the LMM as continuous visual embeddings, which cannot obtain discrete supervised labels for classification. In this paper, we successfully perform multi-modal auto-regressive modeling with a unified objective for the first time. Specifically, we propose the concept of visual words, which maps the visual features to probability distributions over LLM's vocabulary, providing supervision information for visual modelling. We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information. Experimental results and ablation studies on 5 VQA tasks and 4 benchmark toolkits validate the powerful performance of our proposed approach.
翻译:大型语言模型(LLMs)得益于在大量未标注文本语料库上进行的自回归建模方法,展现出强大的感知与推理能力。然而,当将自回归建模扩展到多模态场景以构建大型多模态模型(LMMs)时,存在一个重大难题:图像信息在LMM中被处理为连续的视觉嵌入,无法获得用于分类的离散监督标签。本文首次成功实现了具有统一目标的多模态自回归建模。具体而言,我们提出了视觉词汇的概念,将视觉特征映射为LLM词汇表上的概率分布,为视觉建模提供监督信息。我们进一步探究了LMM中视觉特征在语义空间内的分布情况,以及利用文本嵌入表示视觉信息的可能性。在5个VQA任务和4个基准工具包上的实验结果与消融研究验证了我们所提出方法的强大性能。