Automatic Personalized Impression Generation for PET Reports Using Large Language Models

Xin Tie,Muheon Shin,Ali Pirasteh,Nevein Ibrahim,Zachary Huemann,Sharon M. Castellino,Kara M. Kelly,John Garrett,Junjie Hu,Steve Y. Cho,Tyler J. Bradshaw

from arxiv, 18 pages for the main body, 13 pages for the appendix. 6 figures and 3 tables in the main body. This manuscript is submitted to Radiology: Artificial Intelligence

Purpose: To determine if fine-tuned large language models (LLMs) can generate accurate, personalized impressions for whole-body PET reports. Materials and Methods: Twelve language models were trained on a corpus of PET reports using the teacher-forcing algorithm, with the report findings as input and the clinical impressions as reference. An extra input token encodes the reading physician's identity, allowing models to learn physician-specific reporting styles. Our corpus comprised 37,370 retrospective PET reports collected from our institution between 2010 and 2022. To identify the best LLM, 30 evaluation metrics were benchmarked against quality scores from two nuclear medicine (NM) physicians, with the most aligned metrics selecting the model for expert evaluation. In a subset of data, model-generated impressions and original clinical impressions were assessed by three NM physicians according to 6 quality dimensions and an overall utility score (5-point scale). Each physician reviewed 12 of their own reports and 12 reports from other physicians. Bootstrap resampling was used for statistical analysis. Results: Of all evaluation metrics, domain-adapted BARTScore and PEGASUSScore showed the highest Spearman's rho correlations (0.568 and 0.563) with physician preferences. Based on these metrics, the fine-tuned PEGASUS model was selected as the top LLM. When physicians reviewed PEGASUS-generated impressions in their own style, 89% were considered clinically acceptable, with a mean utility score of 4.08/5. Physicians rated these personalized impressions as comparable in overall utility to the impressions dictated by other physicians (4.03, P=0.41). Conclusion: Personalized impressions generated by PEGASUS were clinically useful, highlighting its potential to expedite PET reporting.

翻译：目的：探究经过微调的大型语言模型能否为全身PET报告生成准确、个性化的印象。材料与方法：采用教师强制算法，基于PET报告语料库训练了12种语言模型，以报告发现作为输入，临床印象作为参考。通过额外输入标记对阅片医师身份进行编码，使模型能够学习医师特定的报告风格。研究语料库包含2010至2022年间从本院收集的37,370份回顾性PET报告。为筛选最优模型，将30项评估指标与两位核医学医师的质量评分进行基准测试，选取与医师偏好最一致的指标来选定模型供专家评估。在数据子集中，三位核医学医师从6个质量维度和整体效用评分（5分制）对模型生成的印象和原始临床印象进行评估。每位医师审阅12份自身报告和12份其他医师报告，采用自举重采样进行统计分析。结果：在所有评估指标中，领域自适应BARTScore和PEGASUSScore与医师偏好的斯皮尔曼相关系数最高（分别为0.568和0.563）。基于这些指标，选择经过微调的PEGASUS模型作为最优模型。当医师审阅以自身风格生成的印象时，89%被认为具有临床可接受性，平均效用得分为4.08/5。医师认为这些个性化印象的整体效用与其他医师口述的印象相当（4.03，P=0.41）。结论：PEGASUS生成的个性化印象具有临床实用性，凸显其加速PET报告生成的潜力。