LLM-as-a-Judge in Healthcare: A Scoping Analysis of Applications, Methods, and Human Alignment

Large language models (LLMs) are increasingly deployed across healthcare applications, including clinical documentation, diagnostic reasoning, medicine recommendation, and medical education. Their outputs are largely unstructured clinical text, which is difficult to reliably evaluate at scale. LLM-as-a-Judge, in which an LLM evaluates another system's output against task-specific criteria, offers a scalable alternative and is increasingly used in clinical evaluation, yet its validity in healthcare remains underexamined. Existing reviews focus on general-purpose LLM evaluation or on risk framework, rather than systematically characterizing how LLM-as-a-Judge is applied in healthcare and how well their judgments align with human experts. We therefore conduct a PRISMA-guided comprehensive review of LLM-as-a-Judge applications in healthcare, searching five databases for studies published between January 2023 and February 2026. After screening 541 records, 134 studies meet the eligibility and are coded by health scenario, judge configuration, technical approach, and validation design. LLM-as-a-Judge is concentrated in clinical decision support, clinical natural language processing (NLP), medical knowledge and question answering (QA), and medical communication. OpenAI models are the most frequently used judges, and prompt engineering appears in nearly all studies, with ensemble, multi-agent, and retrieval-augmented designs as common extensions. Among studies reporting human validation, LLM judges often show moderate to strong alignment with expert judgments, although reliability varies substantially across tasks. Overall, this review positions LLM-as-a-Judge as a promising framework for scalable healthcare AI evaluation, while emphasizing that its clinical value depends on model design and rigorous validation.

翻译：大型语言模型（LLM）正被越来越多地部署于医疗健康应用，包括临床文档、诊断推理、药物推荐及医学教育领域。其输出多为非结构化的临床文本，难以大规模可靠评估。LLM-as-a-Judge（以LLM作为评估者）方法，即由一个LLM根据特定任务标准评估另一系统的输出，提供了一种可扩展的替代方案，并在临床评估中日益普及，但其在医疗健康领域的有效性仍缺乏充分审视。现有综述聚焦于通用型LLM评估或风险框架，而非系统性地刻画LLM-as-a-Judge在医疗健康领域的应用方式及其判断与人类专家的一致性程度。为此，我们遵循PRISMA指南，对LLM-as-a-Judge在医疗健康领域的应用开展了全面综述，检索2023年1月至2026年2月间五个数据库的相关研究。经筛选541篇文献后，134项研究符合纳入标准，并按医疗场景、评估者配置、技术方法与验证设计进行编码。LLM-as-a-Judge的应用集中于临床决策支持、临床自然语言处理（NLP）、医学知识与问答（QA）及医学沟通领域。OpenAI模型是使用频率最高的评估者，几乎所有研究均涉及提示工程，集成、多智能体及检索增强设计为常见扩展形式。在报告人类验证的研究中，LLM评估者通常与专家判断呈现中等至较强的一致性，尽管不同任务间的可靠性差异显著。总体而言，本综述将LLM-as-a-Judge定位为具有前景的可扩展医疗AI评估框架，同时强调其临床价值取决于模型设计及严格验证。

相关内容

健康

关注 27

健康是指一个人在身体、精神和社会等方面都处于良好的状态。健康包括两个方面的内容：

一是主要脏器无疾病，身体形态发育良好，体形均匀，人体各系统具有良好的生理功能，有较强的身体活动能力和劳动能力，这是对健康最基本的要求；

二是对疾病的抵抗能力较强，能够适应环境变化，各种生理刺激以及致病因素对身体的作用。传统的健康观是“无病即健康”，现代人的健康观是整体健康，世界卫生组织提出“健康不仅是躯体没有疾病，还要具备心理健康、社会适应良好和有道德”。因此，现代人的健康内容包括：躯体健康、心理健康、心灵健康、社会健康、智力健康、道德健康、环境健康等。健康是人的基本权利。健康是人生的第一财富。

LLM/智能体作为数据分析师：综述

专知会员服务

38+阅读 · 2025年9月30日

【新书】设计大型语言模型应用：一种面向LLMs的整体方法

专知会员服务

56+阅读 · 2025年3月16日

利用多个大型语言模型：关于LLM集成的调研

专知会员服务

35+阅读 · 2025年2月27日

LLM4SR：关于大规模语言模型在科学研究中的应用综述

专知会员服务

42+阅读 · 2025年1月9日