In the generative AI era, where even critical medical tasks are increasingly automated, radiology report generation (RRG) continues to rely on suboptimal metrics for quality assessment. Developing domain-specific metrics has therefore been an active area of research, yet it remains challenging due to the lack of a unified, well-defined framework to assess their robustness and applicability in clinical contexts. To address this, we present CTest-Metric, a first unified metric assessment framework with three modules determining the clinical feasibility of metrics for CT RRG. The modules test: (i) Writing Style Generalizability (WSG) via LLM-based rephrasing; (ii) Synthetic Error Injection (SEI) at graded severities; and (iii) Metrics-vs-Expert correlation (MvE) using clinician ratings on 175 "disagreement" cases. Eight widely used metrics (BLEU, ROUGE, METEOR, BERTScore-F1, F1-RadGraph, RaTEScore, GREEN Score, CRG) are studied across seven LLMs built on a CT-CLIP encoder. Using our novel framework, we found that lexical NLG metrics are highly sensitive to stylistic variations; GREEN Score aligns best with expert judgments (Spearman~0.70), while CRG shows negative correlation; and BERTScore-F1 is least sensitive to factual error injection. We will release the framework, code, and allowable portion of the anonymized evaluation data (rephrased/error-injected CT reports), to facilitate reproducible benchmarking and future metric development.
翻译:在生成式人工智能时代,即使关键医疗任务日益自动化,放射学报告生成(RRG)的质量评估仍依赖于次优的指标。因此,开发领域专用指标一直是研究热点,但由于缺乏统一、明确的框架来评估其在临床环境中的稳健性和适用性,这仍然具有挑战性。为此,我们提出了CTest-Metric,这是首个统一的指标评估框架,包含三个模块,用于确定CT RRG指标的临床可行性。这些模块测试:(i)通过基于LLM的改写评估写作风格泛化性(WSG);(ii)在分级严重程度下进行合成错误注入(SEI);以及(iii)利用临床医生对175个“分歧”病例的评分,评估指标与专家相关性(MvE)。我们在基于CT-CLIP编码器构建的七个LLM上,研究了八个广泛使用的指标(BLEU、ROUGE、METEOR、BERTScore-F1、F1-RadGraph、RaTEScore、GREEN Score、CRG)。使用我们的新框架,我们发现词汇NLG指标对风格变化高度敏感;GREEN Score与专家判断最一致(Spearman~0.70),而CRG显示出负相关;BERTScore-F1对事实错误注入最不敏感。我们将发布该框架、代码以及允许公开的匿名评估数据(改写/错误注入的CT报告)部分,以促进可复现的基准测试和未来指标开发。