Multi-modal large language models (MLLMs) have been given free rein to explore exciting medical applications with a primary focus on radiology report generation. Nevertheless, the preliminary success in 2D radiology captioning is incompetent to reflect the real-world diagnostic challenge in the volumetric 3D anatomy. To mitigate three crucial limitation aspects in the existing literature, including (1) data complexity, (2) model capacity, and (3) evaluation metric fidelity, we collected an 18,885 text-scan pairs 3D-BrainCT dataset and applied clinical visual instruction tuning (CVIT) to train BrainGPT models to generate radiology-adherent 3D brain CT reports. Statistically, our BrainGPT scored BLEU-1 = 44.35, BLEU-4 = 20.38, METEOR = 30.13, ROUGE-L = 47.6, and CIDEr-R = 211.77 during internal testing and demonstrated an accuracy of 0.91 in captioning midline shifts on the external validation CQ500 dataset. By further inspecting the captioned report, we reported that the traditional metrics appeared to measure only the surface text similarity and failed to gauge the information density of the diagnostic purpose. To close this gap, we proposed a novel Feature-Oriented Radiology Task Evaluation (FORTE) to estimate the report's clinical relevance (lesion feature and landmarks). Notably, the BrainGPT model scored an average FORTE F1-score of 0.71 (degree=0.661; landmark=0.706; feature=0.693; impression=0.779). To demonstrate that BrainGPT models possess objective readiness to generate human-like radiology reports, we conducted a Turing test that enrolled 11 physician evaluators, and around 74% of the BrainGPT-generated captions were indistinguishable from those written by humans. Our work embodies a holistic framework that showcased the first-hand experience of curating a 3D brain CT dataset, fine-tuning anatomy-sensible language models, and proposing robust radiology evaluation metrics.
翻译:多模态大语言模型(MLLMs)在医学应用领域获得了广泛探索,其焦点主要集中于放射学报告生成。然而,二维放射学描述方面的初步成功尚不足以反映三维立体解剖结构中的真实诊断挑战。为缓解现有文献中三个关键局限性方面——包括(1)数据复杂性,(2)模型能力,以及(3)评估指标保真度——我们收集了包含18,885个文本-扫描对的3D-BrainCT数据集,并应用临床视觉指令微调(CVIT)训练BrainGPT模型以生成符合放射学规范的三维脑部CT报告。统计数据显示,在内部测试中,我们的BrainGPT模型取得了BLEU-1 = 44.35、BLEU-4 = 20.38、METEOR = 30.13、ROUGE-L = 47.6以及CIDEr-R = 211.77的评分,并在外部验证数据集CQ500上描述中线移位时表现出0.91的准确率。通过进一步检查生成的报告,我们发现传统指标似乎仅衡量了表层文本相似性,而未能评估诊断目的的信息密度。为弥补这一差距,我们提出了一种新颖的面向特征的放射学任务评估方法(FORTE),用以估计报告的临床相关性(病灶特征与解剖标志)。值得注意的是,BrainGPT模型的平均FORTE F1分数达到0.71(程度=0.661;标志点=0.706;特征=0.693;印象=0.779)。为证明BrainGPT模型具备生成类人放射学报告的客观准备度,我们进行了图灵测试,招募了11名医师评估员,结果显示约74%由BrainGPT生成的描述与人类撰写的报告无法区分。我们的工作体现了一个整体框架,展示了构建三维脑部CT数据集、微调具有解剖感知能力的语言模型以及提出稳健的放射学评估指标的第一手经验。