A Scalable Framework for Evaluating Health Language Models

Neil Mallinar,A. Ali Heydari,Xin Liu,Anthony Z. Faranesh,Brent Winslow,Nova Hammerquist,Benjamin Graef,Cathy Speed,Mark Malhotra,Shwetak Patel,Javier L. Prieto,Daniel McDuff,Ahmed A. Metwally

Large language models (LLMs) have emerged as powerful tools for analyzing complex datasets. Recent studies demonstrate their potential to generate useful, personalized responses when provided with patient-specific health information that encompasses lifestyle, biomarkers, and context. As LLM-driven health applications are increasingly adopted, rigorous and efficient one-sided evaluation methodologies are crucial to ensure response quality across multiple dimensions, including accuracy, personalization and safety. Current evaluation practices for open-ended text responses heavily rely on human experts. This approach introduces human factors and is often cost-prohibitive, labor-intensive, and hinders scalability, especially in complex domains like healthcare where response assessment necessitates domain expertise and considers multifaceted patient data. In this work, we introduce Adaptive Precise Boolean rubrics: an evaluation framework that streamlines human and automated evaluation of open-ended questions by identifying gaps in model responses using a minimal set of targeted rubrics questions. Our approach is based on recent work in more general evaluation settings that contrasts a smaller set of complex evaluation targets with a larger set of more precise, granular targets answerable with simple boolean responses. We validate this approach in metabolic health, a domain encompassing diabetes, cardiovascular disease, and obesity. Our results demonstrate that Adaptive Precise Boolean rubrics yield higher inter-rater agreement among expert and non-expert human evaluators, and in automated assessments, compared to traditional Likert scales, while requiring approximately half the evaluation time of Likert-based methods. This enhanced efficiency, particularly in automated evaluation and non-expert contributions, paves the way for more extensive and cost-effective evaluation of LLMs in health.

翻译：大型语言模型（LLM）已成为分析复杂数据集的强大工具。最新研究表明，当提供包含生活方式、生物标志物和背景信息在内的患者特定健康信息时，LLM能够生成有用的个性化响应。随着LLM驱动的健康应用日益普及，建立严格高效的单向评估方法对于确保响应在准确性、个性化和安全性等多维度的质量至关重要。当前对开放式文本响应的评估实践严重依赖人类专家，这种方法引入了人为因素，通常成本高昂、劳动密集，且难以扩展——特别是在医疗保健等复杂领域，这些领域的响应评估需要领域专业知识并需考虑多方面的患者数据。本研究提出自适应精确布尔量规：一种通过使用最小化目标量规问题识别模型响应缺陷的评估框架，可简化人类与自动化对开放式问题的评估流程。该方法基于近期在更通用评估场景中的研究，该研究将少量复杂评估目标与大量可通过简单布尔响应回答的精确细粒度目标进行对比。我们在代谢健康领域（涵盖糖尿病、心血管疾病和肥胖症）验证了该方法的有效性。实验结果表明，与传统的李克特量表相比，自适应精确布尔量规在专家与非专家人类评估者间及自动化评估中均能获得更高评分者间一致性，同时所需评估时间约为李克特方法的二分之一。这种在自动化评估和非专家参与方面提升的效率，为在健康领域开展更广泛且更具成本效益的LLM评估开辟了新路径。

相关内容

健康

关注 27

健康是指一个人在身体、精神和社会等方面都处于良好的状态。健康包括两个方面的内容：

一是主要脏器无疾病，身体形态发育良好，体形均匀，人体各系统具有良好的生理功能，有较强的身体活动能力和劳动能力，这是对健康最基本的要求；

二是对疾病的抵抗能力较强，能够适应环境变化，各种生理刺激以及致病因素对身体的作用。传统的健康观是“无病即健康”，现代人的健康观是整体健康，世界卫生组织提出“健康不仅是躯体没有疾病，还要具备心理健康、社会适应良好和有道德”。因此，现代人的健康内容包括：躯体健康、心理健康、心灵健康、社会健康、智力健康、道德健康、环境健康等。健康是人的基本权利。健康是人生的第一财富。

迈向个性化大语言模型驱动的智能体：基础、评估与未来方向

专知会员服务

27+阅读 · 2月27日

医学领域大型语言模型的新进展

专知会员服务

25+阅读 · 2025年10月5日

【斯坦福博士论文】大语言模型的AI辅助评估

专知会员服务

31+阅读 · 2025年3月30日

《大语言模型时代的小型语言模型综述：技术、增强、应用、与大语言模型的合作及可信度》

专知会员服务

55+阅读 · 2024年11月7日