Towards a Holistic Framework for Multimodal Large Language Models in Three-dimensional Brain CT Report Generation

Cheng-Yi Li,Kao-Jung Chang,Cheng-Fu Yang,Hsin-Yu Wu,Wenting Chen,Hritik Bansal,Ling Chen,Yi-Ping Yang,Yu-Chun Chen,Shih-Pin Chen,Jiing-Feng Lirng,Kai-Wei Chang,Shih-Hwa Chiou

from arxiv, 6 figures, 5 supplementary figures, 8 supplementary tables

Multi-modal large language models (MLLMs) have been given free rein to explore exciting medical applications with a primary focus on radiology report generation. Nevertheless, the preliminary success in 2D radiology captioning is incompetent to reflect the real-world diagnostic challenge in the volumetric 3D anatomy. To mitigate three crucial limitation aspects in the existing literature, including (1) data complexity, (2) model capacity, and (3) evaluation metric fidelity, we collected an 18,885 text-scan pairs 3D-BrainCT dataset and applied clinical visual instruction tuning (CVIT) to train BrainGPT models to generate radiology-adherent 3D brain CT reports. Statistically, our BrainGPT scored BLEU-1 = 44.35, BLEU-4 = 20.38, METEOR = 30.13, ROUGE-L = 47.6, and CIDEr-R = 211.77 during internal testing and demonstrated an accuracy of 0.91 in captioning midline shifts on the external validation CQ500 dataset. By further inspecting the captioned report, we reported that the traditional metrics appeared to measure only the surface text similarity and failed to gauge the information density of the diagnostic purpose. To close this gap, we proposed a novel Feature-Oriented Radiology Task Evaluation (FORTE) to estimate the report's clinical relevance (lesion feature and landmarks). Notably, the BrainGPT model scored an average FORTE F1-score of 0.71 (degree=0.661; landmark=0.706; feature=0.693; impression=0.779). To demonstrate that BrainGPT models possess objective readiness to generate human-like radiology reports, we conducted a Turing test that enrolled 11 physician evaluators, and around 74% of the BrainGPT-generated captions were indistinguishable from those written by humans. Our work embodies a holistic framework that showcased the first-hand experience of curating a 3D brain CT dataset, fine-tuning anatomy-sensible language models, and proposing robust radiology evaluation metrics.

翻译：多模态大语言模型（MLLMs）在医学应用领域获得了广泛探索，其焦点主要集中于放射学报告生成。然而，二维放射学描述方面的初步成功尚不足以反映三维立体解剖结构中的真实诊断挑战。为缓解现有文献中三个关键局限性方面——包括（1）数据复杂性，（2）模型能力，以及（3）评估指标保真度——我们收集了包含18,885个文本-扫描对的3D-BrainCT数据集，并应用临床视觉指令微调（CVIT）训练BrainGPT模型以生成符合放射学规范的三维脑部CT报告。统计数据显示，在内部测试中，我们的BrainGPT模型取得了BLEU-1 = 44.35、BLEU-4 = 20.38、METEOR = 30.13、ROUGE-L = 47.6以及CIDEr-R = 211.77的评分，并在外部验证数据集CQ500上描述中线移位时表现出0.91的准确率。通过进一步检查生成的报告，我们发现传统指标似乎仅衡量了表层文本相似性，而未能评估诊断目的的信息密度。为弥补这一差距，我们提出了一种新颖的面向特征的放射学任务评估方法（FORTE），用以估计报告的临床相关性（病灶特征与解剖标志）。值得注意的是，BrainGPT模型的平均FORTE F1分数达到0.71（程度=0.661；标志点=0.706；特征=0.693；印象=0.779）。为证明BrainGPT模型具备生成类人放射学报告的客观准备度，我们进行了图灵测试，招募了11名医师评估员，结果显示约74%由BrainGPT生成的描述与人类撰写的报告无法区分。我们的工作体现了一个整体框架，展示了构建三维脑部CT数据集、微调具有解剖感知能力的语言模型以及提出稳健的放射学评估指标的第一手经验。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日