Large Language Models (LLMs) are expected to significantly contribute to patient care, diagnostics, and administrative processes. Emerging biomedical LLMs aim to address healthcare-specific challenges, including privacy demands and computational constraints. Assessing the models' suitability for this sensitive application area is of the utmost importance. However, evaluation has primarily been limited to non-clinical tasks, which do not reflect the complexity of practical clinical applications. To fill this gap, we present the Clinical Language Understanding Evaluation (CLUE), a benchmark tailored to evaluate LLMs on clinical tasks. CLUE includes six tasks to test the practical applicability of LLMs in complex healthcare settings. Our evaluation includes a total of $25$ LLMs. In contrast to previous evaluations, CLUE shows a decrease in performance for nine out of twelve biomedical models. Our benchmark represents a step towards a standardized approach to evaluating and developing LLMs in healthcare to align future model development with the real-world needs of clinical application. We open-source all evaluation scripts and datasets for future research at https://github.com/TIO-IKIM/CLUE.
翻译:大语言模型(LLMs)有望在患者护理、诊断及行政流程中发挥重要作用。新兴的生物医学大语言模型旨在应对医疗健康领域的特定挑战,包括隐私需求与计算资源限制。评估这些模型在此敏感应用领域的适用性至关重要。然而,现有评估主要局限于非临床任务,未能反映实际临床应用的复杂性。为填补这一空白,我们提出了临床语言理解评估(CLUE),这是一个专门用于评估大语言模型在临床任务上表现的基准。CLUE包含六项任务,以测试大语言模型在复杂医疗场景中的实际适用性。我们的评估共涵盖 $25$ 个大语言模型。与既往评估相比,CLUE显示十二个生物医学模型中有九个性能下降。本基准标志着向医疗领域大语言模型评估与开发的标准化方法迈出一步,旨在使未来模型开发与临床实际需求相契合。我们已在 https://github.com/TIO-IKIM/CLUE 开源所有评估脚本与数据集,以供后续研究使用。