Automatic Personalized Impression Generation for PET Reports Using Large Language Models

Xin Tie,Muheon Shin,Ali Pirasteh,Nevein Ibrahim,Zachary Huemann,Sharon M. Castellino,Kara M. Kelly,John Garrett,Junjie Hu,Steve Y. Cho,Tyler J. Bradshaw

from arxiv, 25 pages in total. 6 figures and 3 tables in the main body. The manuscript has been submitted to a journal for potential publication

In this study, we aimed to determine if fine-tuned large language models (LLMs) can generate accurate, personalized impressions for whole-body PET reports. Twelve language models were trained on a corpus of PET reports using the teacher-forcing algorithm, with the report findings as input and the clinical impressions as reference. An extra input token encodes the reading physician's identity, allowing models to learn physician-specific reporting styles. Our corpus comprised 37,370 retrospective PET reports collected from our institution between 2010 and 2022. To identify the best LLM, 30 evaluation metrics were benchmarked against quality scores from two nuclear medicine (NM) physicians, with the most aligned metrics selecting the model for expert evaluation. In a subset of data, model-generated impressions and original clinical impressions were assessed by three NM physicians according to 6 quality dimensions (3-point scale) and an overall utility score (5-point scale). Each physician reviewed 12 of their own reports and 12 reports from other physicians. Bootstrap resampling was used for statistical analysis. Of all evaluation metrics, domain-adapted BARTScore and PEGASUSScore showed the highest Spearman's rank correlations (0.568 and 0.563) with physician preferences. Based on these metrics, the fine-tuned PEGASUS model was selected as the top LLM. When physicians reviewed PEGASUS-generated impressions in their own style, 89% were considered clinically acceptable, with a mean utility score of 4.08 out of 5. Physicians rated these personalized impressions as comparable in overall utility to the impressions dictated by other physicians (4.03, P=0.41). In conclusion, personalized impressions generated by PEGASUS were clinically useful, highlighting its potential to expedite PET reporting.

翻译：本研究旨在探究微调后的大型语言模型（LLMs）能否为全身PET报告生成准确且个性化的印象。我们采用教师强制算法，以PET报告语料库训练了12种语言模型，将报告发现作为输入、临床印象作为参考。通过额外添加的输入标记编码阅片医生身份，使模型能够学习医生特定的报告风格。语料库包含2010年至2022年间从本机构收集的37,370份回顾性PET报告。为筛选最佳LLM，我们以两位核医学（NM）医师的质量评分为基准，对比30项评估指标，选取与医师偏好最一致的指标来选定模型进行专家评估。在子数据集上，三位NM医师从6个质量维度（3分量表）和总体效用评分（5分量表）对模型生成印象与原始临床印象进行评价。每位医师审阅了12份本人报告和12份他人报告，并采用Bootstrap重抽样进行统计分析。在所有评估指标中，领域适配的BARTScore和PEGASUSScore与医师偏好的斯皮尔曼等级相关系数最高（分别为0.568和0.563）。基于这些指标，微调后的PEGASUS模型被选为最优LLM。当医师审阅PEGASUS生成的本风格印象时，89%被认为具有临床可接受性，平均效用评分为4.08（满分5分）。医师评价这些个性化印象与其它医师口述的印象在总体效用上相当（4.03分，P=0.41）。结论表明，PEGASUS生成的个性化印象具有临床实用性，凸显其加快PET报告流程的潜力。