RadGenome-Chest CT: A Grounded Vision-Language Dataset for Chest CT Analysis

Developing generalist foundation model has recently attracted tremendous attention among researchers in the field of AI for Medicine (AI4Medicine). A pivotal insight in developing these models is their reliance on dataset scaling, which emphasizes the requirements on developing open-source medical image datasets that incorporate diverse supervision signals across various imaging modalities. In this paper, we introduce RadGenome-Chest CT, a comprehensive, large-scale, region-guided 3D chest CT interpretation dataset based on CT-RATE. Specifically, we leverage the latest powerful universal segmentation and large language models, to extend the original datasets (over 25,692 non-contrast 3D chest CT volume and reports from 20,000 patients) from the following aspects: (i) organ-level segmentation masks covering 197 categories, which provide intermediate reasoning visual clues for interpretation; (ii) 665 K multi-granularity grounded reports, where each sentence of the report is linked to the corresponding anatomical region of CT volume in the form of a segmentation mask; (iii) 1.3 M grounded VQA pairs, where questions and answers are all linked with reference segmentation masks, enabling models to associate visual evidence with textual explanations. All grounded reports and VQA pairs in the validation set have gone through manual verification to ensure dataset quality. We believe that RadGenome-Chest CT can significantly advance the development of multimodal medical foundation models, by training to generate texts based on given segmentation regions, which is unattainable with previous relevant datasets. We will release all segmentation masks, grounded reports, and VQA pairs to facilitate further research and development in this field.

翻译：开发通用基础模型近年来在医学人工智能领域引起了研究者的极大关注。这类模型开发的关键洞见在于对数据集规模的依赖，这凸显了开发包含不同成像模态下多样化监督信号的开源医学影像数据集的需求。本文提出RadGenome-Chest CT——基于CT-RATE构建的大规模、区域引导的三维胸部CT解读数据集。具体而言，我们利用最新的通用分割模型和大语言模型，从以下方面对原始数据集（来自20000名患者的25692个非增强三维胸部CT容积及对应报告）进行扩展：（i）覆盖197个类别的器官级分割掩码，为解读提供中间推理视觉线索；（ii）66.5万组多粒度指代报告，其中每条报告语句均以分割掩码形式关联至CT容积的对应解剖区域；（iii）130万组指代视觉问答对，所有问题与答案均关联参考分割掩码，使模型能够将视觉证据与文本解释相关联。验证集中所有指代报告和视觉问答对均经过人工验证以确保数据集质量。我们相信，RadGenome-Chest CT通过训练模型基于给定分割区域生成文本（这是此前相关数据集无法实现的），将显著推动多模态医学基础模型的发展。我们将公开发布所有分割掩码、指代报告和视觉问答对，以促进该领域的进一步研究与应用。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日