GenerateCT: Text-Conditional Generation of 3D Chest CT Volumes

Ibrahim Ethem Hamamci,Sezgin Er,Anjany Sekuboyina,Enis Simsar,Alperen Tezcan,Ayse Gulnihan Simsek,Sevval Nil Esirgun,Furkan Almas,Irem Dogan,Muhammed Furkan Dasdelen,Chinmay Prabhakar,Hadrien Reynaud,Sarthak Pati,Christian Bluethgen,Mehmet Kemal Ozdemir,Bjoern Menze

GenerateCT, the first approach to generating 3D medical imaging conditioned on free-form medical text prompts, incorporates a text encoder and three key components: a novel causal vision transformer for encoding 3D CT volumes, a text-image transformer for aligning CT and text tokens, and a text-conditional super-resolution diffusion model. Without directly comparable methods in 3D medical imaging, we benchmarked GenerateCT against cutting-edge methods, demonstrating its superiority across all key metrics. Importantly, we evaluated GenerateCT's clinical applications in a multi-abnormality classification task. First, we established a baseline by training a multi-abnormality classifier on our real dataset. To further assess the model's generalization to external data and performance with unseen prompts in a zero-shot scenario, we employed an external set to train the classifier, setting an additional benchmark. We conducted two experiments in which we doubled the training datasets by synthesizing an equal number of volumes for each set using GenerateCT. The first experiment demonstrated an 11% improvement in the AP score when training the classifier jointly on real and generated volumes. The second experiment showed a 7% improvement when training on both real and generated volumes based on unseen prompts. Moreover, GenerateCT enables the scaling of synthetic training datasets to arbitrary sizes. As an example, we generated 100,000 3D CTs, fivefold the number in our real set, and trained the classifier exclusively on these synthetic CTs. Impressively, this classifier surpassed the performance of the one trained on all available real data by a margin of 8%. Last, domain experts evaluated the generated volumes, confirming a high degree of alignment with the text prompt. Access our code, model weights, training data, and generated data at https://github.com/ibrahimethemhamamci/GenerateCT

翻译：GenerateCT是首个基于自由形式医学文本提示生成三维医学影像的方法，它包含一个文本编码器和三个关键组件：一种用于编码三维CT体积的新型因果视觉Transformer、一个用于对齐CT与文本标记的文本-图像Transformer，以及一个文本条件超分辨率扩散模型。由于三维医学成像领域缺乏直接可比的方法，我们将GenerateCT与前沿方法进行基准测试，证明其在所有关键指标上均具有优越性。重要的是，我们在多异常分类任务中评估了GenerateCT的临床应用价值。首先，我们在真实数据集上训练多异常分类器以建立基线。为进一步评估模型在外部数据上的泛化能力及在零样本场景下对未见提示的性能，我们采用外部数据集训练分类器，设立了额外基准。我们进行了两组实验：使用GenerateCT为每个数据集合成等量体积数据，从而使训练数据集规模翻倍。第一组实验表明，在真实与生成体积数据上联合训练分类器时，AP分数提升了11%。第二组实验显示，基于未见提示的真实与生成体积数据联合训练时，性能提升了7%。此外，GenerateCT能够将合成训练数据集扩展至任意规模。例如，我们生成了10万个三维CT样本（数量为真实数据集的五倍），并仅使用这些合成CT训练分类器。令人印象深刻的是，该分类器的性能超越了使用全部真实数据训练的模型，优势达8%。最后，领域专家对生成体积进行了评估，确认其与文本提示高度吻合。代码、模型权重、训练数据及生成数据可通过https://github.com/ibrahimethemhamamci/GenerateCT获取。