RSNA Large Language Model Benchmark Dataset for Chest Radiographs of Cardiothoracic Disease: Radiologist Evaluation and Validation Enhanced by AI Labels (REVEAL-CXR)

翻译：RSNA 胸部放射影像心胸疾病大语言模型基准数据集：经AI标注增强的放射科医师评估与验证（REVEAL-CXR）

Yishu Wei,Adam E. Flanders,Errol Colak,John Mongan,Luciano M Prevedello,Po-Hao Chen,Henrique Min Ho Lee,Gilberto Szarf,Hamilton Shoji,Jason Sho,Katherine Andriole,Tessa Cook,Lisa C. Adams,Linda C. Chu,Maggie Chung,Geraldine Brusca-Augello,Djeven P. Deva,Navneet Singh,Felipe Sanchez Tijmes,Jeffrey B. Alpert,Elsie T. Nguyen,Drew A. Torigian,Kate Hanneman,Lauren K Groner,Alexander Phan,Ali Islam,Matias F. Callejas,Gustavo Borges da Silva Teles,Faisal Jamal,Maryam Vazirabad,Ali Tejani,Hari Trivedi,Paulo Kuriki,Rajesh Bhayana,Elana T. Benishay,Yi Lin,Yifan Peng,George Shih

Multimodal large language models have demonstrated comparable performance to that of radiology trainees on multiple-choice board-style exams. However, to develop clinically useful multimodal LLM tools, high-quality benchmarks curated by domain experts are essential. To curate released and holdout datasets of 100 chest radiographic studies each and propose an artificial intelligence (AI)-assisted expert labeling procedure to allow radiologists to label studies more efficiently. A total of 13,735 deidentified chest radiographs and their corresponding reports from the MIDRC were used. GPT-4o extracted abnormal findings from the reports, which were then mapped to 12 benchmark labels with a locally hosted LLM (Phi-4-Reasoning). From these studies, 1,000 were sampled on the basis of the AI-suggested benchmark labels for expert review; the sampling algorithm ensured that the selected studies were clinically relevant and captured a range of difficulty levels. Seventeen chest radiologists participated, and they marked "Agree all", "Agree mostly" or "Disagree" to indicate their assessment of the correctness of the LLM suggested labels. Each chest radiograph was evaluated by three experts. Of these, at least two radiologists selected "Agree All" for 381 radiographs. From this set, 200 were selected, prioritizing those with less common or multiple finding labels, and divided into 100 released radiographs and 100 reserved as the holdout dataset. The holdout dataset is used exclusively by RSNA to independently evaluate different models. A benchmark of 200 chest radiographic studies with 12 benchmark labels was created and made publicly available https://imaging.rsna.org, with each chest radiograph verified by three radiologists. In addition, an AI-assisted labeling procedure was developed to help radiologists label at scale, minimize unnecessary omissions, and support a semicollaborative environment.

翻译：多模态大语言模型在多项选择题形式的委员会考试中已展现出与放射科住院医师相当的表现。然而，要开发具有临床实用性的多模态LLM工具，由领域专家精心构建的高质量基准数据集至关重要。本研究旨在构建各包含100例胸部放射影像研究的公开发布数据集和保留数据集，并提出一种人工智能辅助的专家标注流程，以提高放射科医师的标注效率。研究使用了来自MIDRC的共计13,735张去标识化胸部放射影像及其对应报告。GPT-4o从报告中提取异常发现，随后通过本地部署的LLM（Phi-4-Reasoning）将其映射到12个基准标签。基于AI建议的基准标签，从这些研究中采样了1,000例用于专家评审；采样算法确保所选研究具有临床相关性并涵盖不同难度级别。17位胸部放射科医师参与评审，他们通过选择"完全同意"、"基本同意"或"不同意"来表明其对LLM建议标签正确性的评估。每张胸部放射影像由三位专家评估。其中，至少有两位放射科医师对381张影像选择了"完全同意"。从该集合中，优先选择具有较不常见或多种发现标签的影像，最终选出200例，并将其分为100例公开发布的影像和100例作为保留数据集。保留数据集由RSNA独家用于独立评估不同模型。本研究创建了一个包含200例胸部放射影像研究和12个基准标签的基准数据集，并已在 https://imaging.rsna.org 上公开提供，每张胸部放射影像均经过三位放射科医师验证。此外，还开发了一种AI辅助标注流程，以帮助放射科医师进行大规模标注，最大限度地减少不必要的遗漏，并支持半协作环境。