Large multimodal language models (LMMs) have achieved significant success in general domains. However, due to the significant differences between medical images and text and general web content, the performance of LMMs in medical scenarios is limited. In ophthalmology, clinical diagnosis relies on multiple modalities of medical images, but unfortunately, multimodal ophthalmic large language models have not been explored to date. In this paper, we study and construct an ophthalmic large multimodal model. Firstly, we use fundus images as an entry point to build a disease assessment and diagnosis pipeline to achieve common ophthalmic disease diagnosis and lesion segmentation. Then, we establish a new ophthalmic multimodal instruction-following and dialogue fine-tuning dataset based on disease-related knowledge data and publicly available real-world medical dialogue. We introduce visual ability into the large language model to complete the ophthalmic large language and vision assistant (OphGLM). Our experimental results demonstrate that the OphGLM model performs exceptionally well, and it has the potential to revolutionize clinical applications in ophthalmology. The dataset, code, and models will be made publicly available at https://github.com/ML-AILab/OphGLM.
翻译:大型多模态语言模型(LMMs)在通用领域取得了显著成功。然而,由于医学图像与文本及通用网络内容之间存在显著差异,LMMs在医疗场景中的性能受到限制。在眼科学中,临床诊断依赖于多种模态的医学图像,但遗憾的是,目前尚未有多模态眼科学大型语言模型的相关探索。本文研究并构建了一种眼科学大型多模态模型。首先,我们以眼底图像为切入点,构建疾病评估与诊断流程,以实现常见眼科疾病的诊断及病灶分割。随后,基于疾病相关知识数据和公开的真实世界医学对话,我们建立了一个新的眼科学多模态指令跟随与对话微调数据集。我们将视觉能力引入大型语言模型,完成眼科学大型语言-视觉助手(OphGLM)的构建。实验结果表明,OphGLM模型表现优异,并有望革新眼科学的临床应用。相关数据集、代码及模型将在https://github.com/ML-AILab/OphGLM 公开提供。