Following the impressive development of LLMs, vision-language alignment in LLMs is actively being researched to enable multimodal reasoning and visual IO. This direction of research is particularly relevant to medical imaging because medical image analysis and generation consist of reasoning based on a combination of visual features and prior knowledge. Many recent works have focused on training adapter networks that serve as an information bridge between image processing networks and LLMs; but presumably, in order to achieve maximum reasoning potential of LLMs on visual information as well, visual and language features should be allowed to interact more freely. This is especially important in the medical domain because understanding and generating medical images such as chest X-rays (CXR) require not only accurate visual and language-based reasoning but also a more intimate mapping between the two modalities. Thus, taking inspiration from previous work on the transformer and VQ-GAN combination for bidirectional image and text generation, we build upon this approach and develop a method for instruction-tuning an LLM pre-trained only on text to gain vision-language capabilities for medical images. Specifically, we leverage a pretrained LLM's existing question-answering and instruction-following abilities to teach it to understand visual inputs by instructing it to answer questions about image inputs and, symmetrically, output both text and image responses appropriate to a given query by tuning the LLM with diverse tasks that encompass image-based text-generation and text-based image-generation. We show that our model, LLM-CXR, trained in this approach shows better image-text alignment in both CXR understanding and generation tasks while being smaller in size compared to previously developed models that perform a narrower range of tasks. The code is at https://github.com/hyn2028/llm-cxr.
翻译:继大语言模型(LLM)取得显著进展后,如何通过视觉-语言对齐实现多模态推理与视觉输入输出已成为研究热点。这一方向与医学影像领域尤为相关,因为医学图像分析与生成需基于视觉特征与先验知识的组合推理。近期诸多研究聚焦于训练适配器网络作为图像处理网络与LLM间的信息桥梁;但为充分释放LLM对视觉信息的推理潜能,视觉与语言特征应允许更自由的交互。这在医学领域尤为关键,因为胸部X光片(CXR)等医学图像的理解与生成不仅需要精准的视觉与语言推理能力,更要求两种模态间建立更紧密的映射关系。受前人基于Transformer与VQ-GAN组合实现双向图文生成研究的启发,我们改进了该方案,提出一种对纯文本预训练LLM进行指令微调的方法,使其获得面向医学图像的视觉-语言能力。具体而言,我们利用预训练LLM已有的问答与指令遵循能力,通过引导其回答图像输入相关的问题来学习视觉理解;同时,通过涵盖“基于图像的文本生成”与“基于文本的图像生成”的多任务微调,对称地训练模型输出符合查询要求的图文响应。实验表明,采用该方案训练的LLM-CXR模型在CXR理解与生成任务中均展现出更优的图文对齐效果,且模型尺寸小于此前仅覆盖有限任务范围的基线模型。代码发布于https://github.com/hyn2028/llm-cxr。