The latest breakthroughs in large vision-language models, such as Bard and GPT-4, have showcased extraordinary abilities in performing a wide range of tasks. Such models are trained on massive datasets comprising billions of public image-text pairs with diverse tasks. However, their performance on task-specific domains, such as radiology, is still under-investigated and potentially limited due to a lack of sophistication in understanding biomedical images. On the other hand, conversational medical models have exhibited remarkable success but have mainly focused on text-based analysis. In this paper, we introduce XrayGPT, a novel conversational medical vision-language model that can analyze and answer open-ended questions about chest radiographs. Specifically, we align both medical visual encoder (MedClip) with a fine-tuned large language model (Vicuna), using a simple linear transformation. This alignment enables our model to possess exceptional visual conversation abilities, grounded in a deep understanding of radiographs and medical domain knowledge. To enhance the performance of LLMs in the medical context, we generate ~217k interactive and high-quality summaries from free-text radiology reports. These summaries serve to enhance the performance of LLMs through the fine-tuning process. Our approach opens up new avenues the research for advancing the automated analysis of chest radiographs. Our open-source demos, models, and instruction sets are available at: https://github.com/mbzuai-oryx/XrayGPT.
翻译:大型视觉语言模型(如Bard和GPT-4)的最新突破展示了其在执行广泛任务方面的非凡能力。这些模型基于包含数十亿个公共图像-文本对的大规模数据集进行训练,并覆盖多样化任务。然而,它们在放射学等特定任务领域的表现仍待深入研究,且由于对生物医学图像理解能力的不足而可能受限。另一方面,对话式医学模型已取得显著成功,但主要聚焦于基于文本的分析。本文提出XrayGPT——一种新颖的对话式医学视觉语言模型,能够分析并回答关于胸部X光片的开放式问题。具体而言,我们通过简单的线性变换将医学视觉编码器(MedClip)与经过微调的大型语言模型(Vicuna)对齐。这种对齐使我们的模型具备卓越的视觉对话能力,其基础在于对放射影像和医学领域知识的深入理解。为增强大型语言模型在医学语境中的表现,我们从自由文本放射学报告中生成了约21.7万条交互式高质量摘要,通过微调流程进一步提升其性能。我们的方法为推进胸部X光片自动化分析开辟了新的研究方向。开源演示、模型及指令集可在https://github.com/mbzuai-oryx/XrayGPT 获取。