With the rapid development of multimodal large language models (MLLMs), especially their capabilities in visual chat through refer and ground functionalities, their significance is increasingly recognized. However, the biomedical field currently exhibits a substantial gap in this area, primarily due to the absence of a dedicated refer and ground dataset for biomedical images. To address this challenge, we devised the Med-GRIT-270k dataset. It comprises 270k question-and-answer pairs and spans eight distinct medical imaging modalities. Most importantly, it is the first dedicated to the biomedical domain and integrating refer and ground conversations. The key idea is to sample large-scale biomedical image-mask pairs from medical segmentation datasets and generate instruction datasets from text using chatGPT. Additionally, we introduce a Refer-and-Ground Multimodal Large Language Model for Biomedicine (BiRD) by using this dataset and multi-task instruction learning. Extensive experiments have corroborated the efficacy of the Med-GRIT-270k dataset and the multi-modal, fine-grained interactive capabilities of the BiRD model. This holds significant reference value for the exploration and development of intelligent biomedical assistants.
翻译:随着多模态大语言模型(MLLMs)的快速发展,尤其是其通过参考与定位功能实现的视觉对话能力,其重要性日益凸显。然而,生物医学领域目前在此方面存在显著差距,这主要是由于缺乏专门针对生物医学图像的参考与定位数据集。为应对这一挑战,我们构建了Med-GRIT-270k数据集。该数据集包含27万个问答对,涵盖八种不同的医学成像模态。最重要的是,它是首个专为生物医学领域设计并整合了参考与定位对话的数据集。其核心思路是从医学分割数据集中采样大规模生物医学图像-掩码对,并利用chatGPT从文本生成指令数据集。此外,我们利用该数据集及多任务指令学习,提出了一种面向生物医学的参考与定位多模态大语言模型(BiRD)。大量实验证实了Med-GRIT-270k数据集的有效性以及BiRD模型的多模态、细粒度交互能力。这对于智能生物医学助手的探索与开发具有重要的参考价值。