Chest X-ray report generation and automated evaluation are limited by poor recognition of low-prevalence abnormalities and inadequate handling of clinically important language, including negation and ambiguity. We develop a clinician-guided framework combining human expertise and large language models for multi-label finding extraction from free-text chest X-ray reports and use it to define Ran Score, a finding-level metric for report evaluation. Using three non-overlapping MIMIC-CXR-EN cohorts from a public chest X-ray dataset and an independent ChestX-CN validation cohort, we optimize prompts, establish radiologist-derived reference labels and evaluate report generation models. The optimized framework improves the macro-averaged score from 0.753 to 0.956 on the MIMIC-CXR-EN development cohort, exceeds the CheXbert benchmark by 15.7 percentage points on directly comparable labels, and shows robust generalization on the ChestX-CN validation cohort. Here we show that clinician-guided prompt optimization improves agreement with a radiologist-derived reference standard and that Ran Score enables finding-level evaluation of report fidelity, particularly for low-prevalence abnormalities.
翻译:胸部X光报告生成与自动化评估受限于对低患病率异常的低识别率,以及对包括否定和歧义在内的临床重要语言处理不足。我们开发了一种临床医生引导的框架,结合人类专业知识与大型语言模型,从自由文本胸部X光报告中提取多标签发现,并以此定义兰氏评分——一种面向发现层面的报告评估指标。利用公开胸部X光数据集中的三个非重叠MIMIC-CXR-EN队列及独立验证队列ChestX-CN,我们优化提示词、建立放射科医生推导的参考标签,并评估报告生成模型。优化后框架在MIMIC-CXR-EN开发队列上将宏观平均评分从0.753提升至0.956,在直接可比标签上超过CheXbert基准15.7个百分点,并在ChestX-CN验证队列中展现出稳健的泛化能力。研究表明,临床医生引导的提示词优化可提高与放射科医生推导参考标准的一致性,且兰氏评分支持报告保真度的发现层面评估,尤其适用于低患病率异常。