Building on the recent remarkable development of large language models (LLMs), active attempts are being made to extend the utility of LLMs to multimodal tasks. There have been previous efforts to link language and visual information, and attempts to add visual capabilities to LLMs are ongoing as well. However, existing attempts use LLMs only as image decoders and no attempt has been made to generate images in the same line as the natural language. By adopting a VQ-GAN framework in which latent representations of images are treated as a kind of text tokens, we present a novel method to fine-tune a pre-trained LLM to read and generate images like text without any structural changes, extra training objectives, or the need for training an ad-hoc network while still preserving the of the instruction-following capability of the LLM. We apply this framework to chest X-ray (CXR) image and report generation tasks as it is a domain in which translation of complex information between visual and language domains is important. The code will soon be made publicly available.
翻译:基于近期大语言模型(LLMs)的显著发展,研究人员正积极探索将LLMs的应用扩展至多模态任务。已有研究致力于连接语言与视觉信息,并持续尝试为LLMs赋予视觉能力。然而现有尝试仅将LLMs用作图像解码器,尚未有研究沿自然语言生成思路实现图像生成。我们采用VQ-GAN框架将图像的潜在表征视为一种文本标记,提出一种新颖方法:对预训练LLM进行微调,使其能够像处理文本一样读取和生成图像——无需改变模型结构、添加额外训练目标或训练专用网络,同时完整保留LLM的指令遵循能力。我们将该框架应用于胸部X光(CXR)图像及报告生成任务,因为该领域需要视觉与语言域间复杂信息的互译。代码即将开源。