ReportQA: QA-Based Radiology Report Evaluation

Radiology report evaluation is essential for advancing automated report generation. Natural language generation metrics have limited clinical relevance. Clinical efficacy (CE) metrics evaluate important medical findings, but focus mainly on presence and cover only a limited set of entities. Due to heavy reliance on manual annotations, it is difficult for CE metrics to extend clinical entities or attributes. In clinical practice, radiology reports serve as a medium for information transfer. Clinicians use them to perform downstream diagnostic tasks without directly inspecting images. Based on this insight, we propose ReportQA, a clinical-related and flexible radiology report evaluation framework, supporting detailed quantitative analysis of radiology report generation systems. We first collect datasets covering multiple imaging modalities and anatomical regions. We then construct knowledge trees of clinical entities and attributes with radiologist guidance, and use large language models (LLMs) to extract structured information from raw reports. Next, we generate QA pairs from predefined templates and apply quality control through self-filtering and report-based filtering. During evaluation, the report is treated as context, and an LLM acts as a judge model to answer the QA pairs. Based on the resulting QA accuracy, we introduce QAScore metric. Compared with existing metrics, QAScore shows better alignment with radiologist judgments. Experiments on multiple state-of-the-art vision-language models reveal that current report-based inference paradigms struggle to learn fine-grained clinical representations and exhibit strong negative prior biases. In contrast, question-driven inference provides a more effective alternative. For reproducibility and extensibility, we release the knowledge trees, structured reports, and QA pairs, along with the pipeline code for QA construction and evaluation.

翻译：放射学报告评估对于推动自动化报告生成至关重要。自然语言生成指标在临床相关性方面存在局限。临床效能（CE）指标能够评估重要的医学发现，但主要聚焦于存在性检测，且仅覆盖有限的一组实体。由于严重依赖人工标注，CE指标难以拓展临床实体或属性。在临床实践中，放射学报告充当信息传递的媒介。临床医生无需直接查看图像，即可通过报告执行下游诊断任务。基于这一洞察，我们提出ReportQA——一个临床相关且灵活的放射学报告评估框架，支持对放射学报告生成系统进行详细的定量分析。我们首先收集涵盖多种成像模态和解剖区域的数据集。随后在放射科医生的指导下构建临床实体与属性的知识树，并利用大语言模型（LLMs）从原始报告中提取结构化信息。接着，我们根据预定义模板生成问答对，并通过自筛选和基于报告的筛选进行质量控制。评估时，将报告视为上下文，由LLM作为评判模型回答问答对。基于所得的问答准确率，我们提出了QAScore指标。与现有指标相比，QAScore与放射科医生判断的一致性更好。在多个最先进的视觉-语言模型上的实验表明，当前基于报告推理的范式难以学习细粒度的临床表征，并表现出强烈的固有负先验偏差。相比之下，以问题驱动的推理提供了一种更有效的替代方案。为确保可复现性与可扩展性，我们公开了知识树、结构化报告、问答对，以及用于问答构建与评估的流水线代码。