The increasing interest in developing Artificial Intelligence applications in the medical domain, suffers from the lack of high-quality dataset, mainly due to privacy-related issues. Moreover, the recent rising of Multimodal Large Language Models (MLLM) leads to a need for multimodal medical datasets, where clinical reports and findings are attached to the corresponding CT or MR scans. This paper illustrates the entire workflow for building the data set MedPix 2.0. Starting from the well-known multimodal dataset MedPix\textsuperscript{\textregistered}, mainly used by physicians, nurses and healthcare students for Continuing Medical Education purposes, a semi-automatic pipeline was developed to extract visual and textual data followed by a manual curing procedure where noisy samples were removed, thus creating a MongoDB database. Along with the dataset, we developed a GUI aimed at navigating efficiently the MongoDB instance, and obtaining the raw data that can be easily used for training and/or fine-tuning MLLMs. To enforce this point, we also propose a CLIP-based model trained on MedPix 2.0 for scan classification tasks.
翻译:在医学领域开发人工智能应用日益受到关注,但主要由于隐私相关问题,高质量数据集的缺乏成为一大制约。此外,近期多模态大语言模型的兴起,催生了对多模态医学数据集的需求,这类数据集需将临床报告和发现与对应的CT或MR扫描图像关联起来。本文阐述了构建数据集MedPix 2.0的完整工作流程。我们从广为人知、主要由医生、护士和医学生用于继续医学教育的多模态数据集MedPix\textsuperscript{\textregistered}出发,开发了一个半自动化流程来提取视觉和文本数据,随后通过人工筛选程序移除噪声样本,从而创建了一个MongoDB数据库。除了数据集本身,我们还开发了一个图形用户界面,旨在高效导航MongoDB实例,并获取可直接用于训练和/或微调多模态大语言模型的原始数据。为了强化这一点,我们还提出了一个基于CLIP、在MedPix 2.0上训练的模型,用于扫描图像分类任务。