LLM-CXR: Instruction-Finetuned LLM for CXR Image Understanding and Generation

Following the impressive development of LLMs, vision-language alignment in LLMs is actively being researched to enable multimodal reasoning and visual IO. This direction of research is particularly relevant to medical imaging because medical image analysis and generation consist of reasoning based on a combination of visual features and prior knowledge. Many recent works have focused on training adapter networks that serve as an information bridge between image processing networks and LLMs; but presumably, in order to achieve maximum reasoning potential of LLMs on visual information as well, visual and language features should be allowed to interact more freely. This is especially important in the medical domain because understanding and generating medical images such as chest X-rays (CXR) require not only accurate visual and language-based reasoning but also a more intimate mapping between the two modalities. Thus, taking inspiration from previous work on the transformer and VQ-GAN combination for bidirectional image and text generation, we build upon this approach and develop a method for instruction-tuning an LLM pre-trained only on text to gain vision-language capabilities for medical images. Specifically, we leverage a pretrained LLM's existing question-answering and instruction-following abilities to teach it to understand visual inputs by instructing it to answer questions about image inputs and, symmetrically, output both text and image responses appropriate to a given query by tuning the LLM with diverse tasks that encompass image-based text-generation and text-based image-generation. We show that our model, LLM-CXR, trained in this approach shows better image-text alignment in both CXR understanding and generation tasks while being smaller in size compared to previously developed models that perform a narrower range of tasks. The code is at https://github.com/hyn2028/llm-cxr.

翻译：随着大语言模型的迅猛发展，研究者正积极探索大语言模型中视觉-语言对齐技术，以实现多模态推理与视觉输入输出。这一研究方向对医学影像领域尤为重要，因为医学图像分析与生成需要结合视觉特征与先验知识进行推理。近期许多工作聚焦于训练适配器网络，作为图像处理网络与大语言模型之间的信息桥梁；但若要充分发挥大语言模型在视觉信息上的推理潜力，视觉与语言特征应能更自由地交互。这在医学领域尤为关键，因为胸部X光片等医学图像的理解与生成不仅需要精确的视觉与语言推理，更要求两种模态间建立更紧密的映射。受此前基于Transformer与VQ-GAN双向图像文本生成研究的启发，我们发展出一种方法：对仅基于文本预训练的大语言模型进行指令微调，使其获得医学图像的视觉-语言能力。具体而言，我们利用预训练大语言模型已有的问答与指令遵循能力，通过指导其回答图像输入相关问题来学习理解视觉输入；同时，通过包含基于图像的文本生成和基于文本的图像生成等多种任务的指令微调，使模型能对称地输出与给定查询匹配的文本与图像响应。实验表明，采用该方法训练的LLM-CXR模型在胸部X光片理解与生成任务中均展现出更优的图像-文本对齐性能，且模型规模小于此前性能范围更窄的模型。代码开源地址：https://github.com/hyn2028/llm-cxr。