MedVAL: Toward Expert-Level Medical Text Validation with Language Models

Asad Aali,Vasiliki Bikia,Maya Varma,Nicole Chiou,Sophie Ostmeier,Arnav Singhvi,Magdalini Paschali,Ashwin Kumar,Andrew Johnston,Karimar Amador-Martinez,Eduardo Juan Perez Guerrero,Paola Naovi Cruz Rivera,Sergios Gatidis,Christian Bluethgen,Eduardo Pontes Reis,Eddy D. Zandee van Rilland,Poonam Laxmappa Hosamani,Kevin R Keet,Minjoung Go,Evelyn Ling,David B. Larson,Curtis Langlotz,Roxana Daneshjou,Jason Hom,Sanmi Koyejo,Emily Alsentzer,Akshay S. Chaudhari

With the growing use of language models (LMs) in clinical environments, there is an immediate need to evaluate the accuracy and safety of LM-generated medical text. Currently, such evaluation relies solely on manual physician review. However, detecting errors in LM-generated text is challenging because 1) manual review is costly and 2) expert-composed reference outputs are often unavailable in real-world settings. While the "LLM-as-a-judge" paradigm offers scalable evaluation, even frontier LMs can miss subtle but clinically significant errors. We propose MedVAL, a novel, self-supervised, data-efficient distillation method that leverages synthetic data to train evaluator LMs to assess whether LM-generated medical outputs are factually consistent with inputs, without requiring physician labels or reference outputs. To evaluate LM performance, we introduce MedVAL-Bench, a dataset of 840 physician-annotated outputs across 6 diverse medical tasks capturing real-world challenges. Across 10 state-of-the-art LMs spanning open-source and proprietary models, MedVAL distillation significantly improves (p < 0.001) alignment with physicians across seen and unseen tasks, increasing average F1 scores from 66% to 83%. Despite strong baseline performance, MedVAL improves the best-performing proprietary LM (GPT-4o) by 8% without training on physician-labeled data, demonstrating a performance statistically non-inferior to a single human expert on a subset annotated by multiple physicians (p < 0.001). To support a scalable, risk-aware pathway towards clinical integration, we open-source: 1) Codebase (https://github.com/StanfordMIMI/MedVAL), 2) MedVAL-Bench (https://huggingface.co/datasets/stanfordmimi/MedVAL-Bench), 3) MedVAL-4B (https://huggingface.co/stanfordmimi/MedVAL-4B). Our benchmark provides evidence of LMs approaching expert-level ability in validating AI-generated medical text.

翻译：随着语言模型在临床环境中的应用日益广泛，迫切需要评估语言模型生成的医学文本的准确性与安全性。目前，此类评估完全依赖于医生的手动审查。然而，检测语言模型生成文本中的错误具有挑战性，原因在于：1）手动审查成本高昂；2）在现实场景中，通常缺乏由专家撰写的参考输出。尽管“LLM即评判者”范式提供了可扩展的评估方法，但即使是前沿的语言模型也可能遗漏细微但具有临床意义的错误。我们提出MedVAL，一种新颖、自监督且数据高效的知识蒸馏方法，该方法利用合成数据来训练评估器语言模型，以评估语言模型生成的医学输出是否与输入事实一致，而无需医生标注或参考输出。为了评估语言模型的性能，我们引入了MedVAL-Bench，这是一个包含840个医生标注输出的数据集，涵盖6种不同的医学任务，捕捉了现实世界的挑战。在涵盖开源和专有模型的10个最先进的语言模型中，MedVAL蒸馏方法显著提升了（p < 0.001）与医生判断在已见和未见任务上的一致性，将平均F1分数从66%提高到83%。尽管基线性能强劲，MedVAL在未使用医生标注数据训练的情况下，将表现最佳的专有模型（GPT-4o）的性能提升了8%，在由多位医生标注的子集上，其表现统计上不劣于单个人类专家（p < 0.001）。为了支持一条可扩展、风险感知的临床整合路径，我们开源了：1) 代码库 (https://github.com/StanfordMIMI/MedVAL)，2) MedVAL-Bench (https://huggingface.co/datasets/stanfordmimi/MedVAL-Bench)，3) MedVAL-4B (https://huggingface.co/stanfordmimi/MedVAL-4B)。我们的基准测试提供了证据，表明语言模型在验证AI生成的医学文本方面正接近专家级能力。