Standardizing Longitudinal Radiology Report Evaluation via Large Language Model Annotation

Longitudinal information in radiology reports refers to the sequential tracking of findings across multiple examinations over time, which is crucial for monitoring disease progression and guiding clinical decisions. Many recent automated radiology report generation methods are designed to capture longitudinal information; however, validating their performance is challenging. There is no proper tool to consistently label temporal changes in both ground-truth and model-generated texts for meaningful comparisons. Existing annotation methods are typically labor-intensive, relying on the use of manual lexicons and rules. Complex rules are closed-source, domain specific and hard to adapt, whereas overly simple ones tend to miss essential specialised information. Large language models (LLMs) offer a promising annotation alternative, as they are capable of capturing nuanced linguistic patterns and semantic similarities without extensive manual intervention. They also adapt well to new contexts. In this study, we therefore propose an LLM-based pipeline to automatically annotate longitudinal information in radiology reports. The pipeline first identifies sentences containing relevant information and then extracts the progression of diseases. We evaluate and compare five mainstream LLMs on these two tasks using 500 manually annotated reports. Considering both efficiency and performance, Qwen2.5-32B was subsequently selected and used to annotate another 95,169 reports from the public MIMIC-CXR dataset. Our Qwen2.5-32B-annotated dataset provided us with a standardized benchmark for evaluating report generation models. Using this new benchmark, we assessed seven state-of-the-art report generation models. Our LLM-based annotation method outperforms existing annotation solutions, achieving 11.3\% and 5.3\% higher F1-scores for longitudinal information detection and disease tracking, respectively.

翻译：放射学报告中的纵向信息指的是随时间推移，在多次检查中对发现结果的连续追踪，这对于监测疾病进展和指导临床决策至关重要。许多近期的自动化放射学报告生成方法旨在捕捉纵向信息；然而，验证其性能具有挑战性。目前缺乏合适的工具来一致地标注真实文本和模型生成文本中的时序变化，以进行有意义的比较。现有的标注方法通常是劳动密集型的，依赖于手动构建的词典和规则。复杂的规则往往是闭源的、领域特定的且难以适配，而过于简单的规则则容易遗漏关键的专门信息。大语言模型（LLMs）提供了一种有前景的标注替代方案，因为它们能够捕捉细微的语言模式和语义相似性，而无需大量人工干预。它们也能很好地适应新的语境。因此，在本研究中，我们提出了一种基于LLM的流程来自动标注放射学报告中的纵向信息。该流程首先识别包含相关信息的句子，然后提取疾病的进展过程。我们使用500份人工标注的报告，在这两项任务上评估并比较了五种主流LLM。综合考虑效率和性能，我们随后选择了Qwen2.5-32B，并用它来标注来自公开MIMIC-CXR数据集的另外95,169份报告。我们基于Qwen2.5-32B标注的数据集为评估报告生成模型提供了一个标准化的基准。利用这个新基准，我们评估了七种最先进的报告生成模型。我们基于LLM的标注方法优于现有的标注解决方案，在纵向信息检测和疾病追踪任务上分别实现了11.3%和5.3%的F1分数提升。