In this work, we investigate the potential of a large language model (LLM) to directly comprehend visual signals without the necessity of fine-tuning on multi-modal datasets. The foundational concept of our method views an image as a linguistic entity, and translates it to a set of discrete words derived from the LLM's vocabulary. To achieve this, we present the Vision-to-Language Tokenizer, abbreviated as V2T Tokenizer, which transforms an image into a ``foreign language'' with the combined aid of an encoder-decoder, the LLM vocabulary, and a CLIP model. With this innovative image encoding, the LLM gains the ability not only for visual comprehension but also for image denoising and restoration in an auto-regressive fashion-crucially, without any fine-tuning. We undertake rigorous experiments to validate our method, encompassing understanding tasks like image recognition, image captioning, and visual question answering, as well as image denoising tasks like inpainting, outpainting, deblurring, and shift restoration. Code and models are available at https://github.com/zh460045050/V2L-Tokenizer.
翻译:本研究探讨了大型语言模型(LLM)在不经多模态数据集微调的情况下,直接理解视觉信号的潜力。我们方法的基本概念将图像视为一种语言实体,并将其转换为从LLM词汇表中衍生出的一组离散词汇。为实现这一目标,我们提出了视觉到语言分词器(简称V2T分词器),该分词器借助编码器-解码器、LLM词汇表和CLIP模型,将图像转换为一种“外语”。通过这种创新的图像编码方式,LLM不仅获得了视觉理解能力,还能以自回归方式执行图像去噪和恢复任务——关键在于无需任何微调。我们进行了严格的实验来验证该方法,涵盖图像识别、图像描述和视觉问答等理解任务,以及图像修复、图像外延、去模糊和移位恢复等图像去噪任务。代码和模型已开源在https://github.com/zh460045050/V2L-Tokenizer。