Large language models (LLMs) have emerged as powerful tools for analyzing complex datasets. Recent studies demonstrate their potential to generate useful, personalized responses when provided with patient-specific health information that encompasses lifestyle, biomarkers, and context. As LLM-driven health applications are increasingly adopted, rigorous and efficient one-sided evaluation methodologies are crucial to ensure response quality across multiple dimensions, including accuracy, personalization and safety. Current evaluation practices for open-ended text responses heavily rely on human experts. This approach introduces human factors and is often cost-prohibitive, labor-intensive, and hinders scalability, especially in complex domains like healthcare where response assessment necessitates domain expertise and considers multifaceted patient data. In this work, we introduce Adaptive Precise Boolean rubrics: an evaluation framework that streamlines human and automated evaluation of open-ended questions by identifying gaps in model responses using a minimal set of targeted rubrics questions. Our approach is based on recent work in more general evaluation settings that contrasts a smaller set of complex evaluation targets with a larger set of more precise, granular targets answerable with simple boolean responses. We validate this approach in metabolic health, a domain encompassing diabetes, cardiovascular disease, and obesity. Our results demonstrate that Adaptive Precise Boolean rubrics yield higher inter-rater agreement among expert and non-expert human evaluators, and in automated assessments, compared to traditional Likert scales, while requiring approximately half the evaluation time of Likert-based methods. This enhanced efficiency, particularly in automated evaluation and non-expert contributions, paves the way for more extensive and cost-effective evaluation of LLMs in health.
翻译:大型语言模型(LLM)已成为分析复杂数据集的强大工具。最新研究表明,当提供包含生活方式、生物标志物和背景信息在内的患者特定健康信息时,LLM能够生成有用的个性化响应。随着LLM驱动的健康应用日益普及,建立严格高效的单向评估方法对于确保响应在准确性、个性化和安全性等多维度的质量至关重要。当前对开放式文本响应的评估实践严重依赖人类专家,这种方法引入了人为因素,通常成本高昂、劳动密集,且难以扩展——特别是在医疗保健等复杂领域,这些领域的响应评估需要领域专业知识并需考虑多方面的患者数据。本研究提出自适应精确布尔量规:一种通过使用最小化目标量规问题识别模型响应缺陷的评估框架,可简化人类与自动化对开放式问题的评估流程。该方法基于近期在更通用评估场景中的研究,该研究将少量复杂评估目标与大量可通过简单布尔响应回答的精确细粒度目标进行对比。我们在代谢健康领域(涵盖糖尿病、心血管疾病和肥胖症)验证了该方法的有效性。实验结果表明,与传统的李克特量表相比,自适应精确布尔量规在专家与非专家人类评估者间及自动化评估中均能获得更高评分者间一致性,同时所需评估时间约为李克特方法的二分之一。这种在自动化评估和非专家参与方面提升的效率,为在健康领域开展更广泛且更具成本效益的LLM评估开辟了新路径。