With the growing use of language models (LMs) in clinical environments, there is an immediate need to evaluate the accuracy and safety of LM-generated medical text. Currently, such evaluation relies solely on manual physician review. However, detecting errors in LM-generated text is challenging because 1) manual review is costly and 2) expert-composed reference outputs are often unavailable in real-world settings. While the "LLM-as-a-judge" paradigm offers scalable evaluation, even frontier LMs can miss subtle but clinically significant errors. We propose MedVAL, a novel, self-supervised, data-efficient distillation method that leverages synthetic data to train evaluator LMs to assess whether LM-generated medical outputs are factually consistent with inputs, without requiring physician labels or reference outputs. To evaluate LM performance, we introduce MedVAL-Bench, a dataset of 840 physician-annotated outputs across 6 diverse medical tasks capturing real-world challenges. Across 10 state-of-the-art LMs spanning open-source and proprietary models, MedVAL distillation significantly improves (p < 0.001) alignment with physicians across seen and unseen tasks, increasing average F1 scores from 66% to 83%. Despite strong baseline performance, MedVAL improves the best-performing proprietary LM (GPT-4o) by 8% without training on physician-labeled data, demonstrating a performance statistically non-inferior to a single human expert on a subset annotated by multiple physicians (p < 0.001). To support a scalable, risk-aware pathway towards clinical integration, we open-source: 1) Codebase (https://github.com/StanfordMIMI/MedVAL), 2) MedVAL-Bench (https://huggingface.co/datasets/stanfordmimi/MedVAL-Bench), 3) MedVAL-4B (https://huggingface.co/stanfordmimi/MedVAL-4B). Our benchmark provides evidence of LMs approaching expert-level ability in validating AI-generated medical text.
翻译:随着语言模型在临床环境中的应用日益广泛,迫切需要评估语言模型生成的医学文本的准确性与安全性。目前,此类评估完全依赖于医生的手动审查。然而,检测语言模型生成文本中的错误具有挑战性,原因在于:1)手动审查成本高昂;2)在现实场景中,通常缺乏由专家撰写的参考输出。尽管“LLM即评判者”范式提供了可扩展的评估方法,但即使是前沿的语言模型也可能遗漏细微但具有临床意义的错误。我们提出MedVAL,一种新颖、自监督且数据高效的知识蒸馏方法,该方法利用合成数据来训练评估器语言模型,以评估语言模型生成的医学输出是否与输入事实一致,而无需医生标注或参考输出。为了评估语言模型的性能,我们引入了MedVAL-Bench,这是一个包含840个医生标注输出的数据集,涵盖6种不同的医学任务,捕捉了现实世界的挑战。在涵盖开源和专有模型的10个最先进的语言模型中,MedVAL蒸馏方法显著提升了(p < 0.001)与医生判断在已见和未见任务上的一致性,将平均F1分数从66%提高到83%。尽管基线性能强劲,MedVAL在未使用医生标注数据训练的情况下,将表现最佳的专有模型(GPT-4o)的性能提升了8%,在由多位医生标注的子集上,其表现统计上不劣于单个人类专家(p < 0.001)。为了支持一条可扩展、风险感知的临床整合路径,我们开源了:1) 代码库 (https://github.com/StanfordMIMI/MedVAL),2) MedVAL-Bench (https://huggingface.co/datasets/stanfordmimi/MedVAL-Bench),3) MedVAL-4B (https://huggingface.co/stanfordmimi/MedVAL-4B)。我们的基准测试提供了证据,表明语言模型在验证AI生成的医学文本方面正接近专家级能力。