In this work, we propose Visual-Predictive Instruction Tuning (VPiT) - a simple and effective extension to visual instruction tuning that enables a pretrained LLM to quickly morph into an unified autoregressive model capable of generating both text and visual tokens. VPiT teaches an LLM to predict discrete text tokens and continuous visual tokens from any input sequence of image and text data curated in an instruction-following format. Our empirical investigation reveals several intriguing properties of VPiT: (1) visual generation ability emerges as a natural byproduct of improved visual understanding, and can be unlocked efficiently with a small amount of generation data; (2) while we find understanding and generation to be mutually beneficial, understanding data contributes to both capabilities more effectively than generation data. Building upon these findings, we train our MetaMorph model and achieve competitive performance on both visual understanding and generation. In visual generation, MetaMorph can leverage the world knowledge and reasoning abilities gained from LLM pretraining, and overcome common failure modes exhibited by other generation models. Our results suggest that LLMs may have strong "prior" vision capabilities that can be efficiently adapted to both visual understanding and generation with a relatively simple instruction tuning process.
翻译:在本工作中,我们提出了视觉预测指令微调——一种对视觉指令微调简单而有效的扩展方法,它能使预训练的大型语言模型快速演变为一个能够同时生成文本和视觉标记的统一自回归模型。VPiT 教导大型语言模型从任何以指令跟随格式编排的图像和文本数据输入序列中,预测离散的文本标记和连续的视觉标记。我们的实证研究揭示了 VPiT 的几个有趣特性:(1)视觉生成能力作为视觉理解能力提升的自然副产品而涌现,并且仅需少量生成数据即可高效解锁;(2)尽管我们发现理解与生成能力相互促进,但理解数据对这两种能力的贡献比生成数据更为有效。基于这些发现,我们训练了 MetaMorph 模型,并在视觉理解和生成任务上均取得了有竞争力的性能。在视觉生成方面,MetaMorph 能够利用从大型语言模型预训练中获得的世界知识和推理能力,并克服其他生成模型常见的失败模式。我们的结果表明,大型语言模型可能具备强大的视觉“先验”能力,通过相对简单的指令微调过程,即可高效地适应于视觉理解和生成任务。