The field of healthcare has increasingly turned its focus towards Large Language Models (LLMs) due to their remarkable performance. However, their performance in actual clinical applications has been underexplored. Traditional evaluations based on question-answering tasks don't fully capture the nuanced contexts. This gap highlights the need for more in-depth and practical assessments of LLMs in real-world healthcare settings. Objective: We sought to evaluate the performance of LLMs in the complex clinical context of adult critical care medicine using systematic and comprehensible analytic methods, including clinician annotation and adjudication. Methods: We investigated the performance of three general LLMs in understanding and processing real-world clinical notes. Concepts from 150 clinical notes were identified by MetaMap and then labeled by 9 clinicians. Each LLM's proficiency was evaluated by identifying the temporality and negation of these concepts using different prompts for an in-depth analysis. Results: GPT-4 showed overall superior performance compared to other LLMs. In contrast, both GPT-3.5 and text-davinci-003 exhibit enhanced performance when the appropriate prompting strategies are employed. The GPT family models have demonstrated considerable efficiency, evidenced by their cost-effectiveness and time-saving capabilities. Conclusion: A comprehensive qualitative performance evaluation framework for LLMs is developed and operationalized. This framework goes beyond singular performance aspects. With expert annotations, this methodology not only validates LLMs' capabilities in processing complex medical data but also establishes a benchmark for future LLM evaluations across specialized domains.
翻译:医疗领域近年来日益关注大语言模型(LLMs)因其卓越性能。然而,它们在实际临床应用中的表现尚未得到充分探索。基于问答任务的传统评估未能完全捕捉细微上下文。这一差距凸显了在真实医疗环境中对LLMs进行更深入、更实际评估的必要性。目的:我们旨在通过系统且可理解的解析方法(包括临床医生标注和判定)评估LLMs在成人重症监护医学复杂临床语境中的性能。方法:我们研究了三种通用LLMs在理解和处理真实临床记录方面的能力。通过MetaMap识别150份临床记录中的概念,并由9名临床医生进行标注。通过使用不同提示词识别这些概念的时间性和否定性来评估每个LLM的熟练程度,并进行深入分析。结果:GPT-4相比其他LLMs表现出整体卓越性能。相比之下,当采用适当提示策略时,GPT-3.5和text-davinci-003均展现出增强性能。GPT系列模型已展现显著效率,体现在其成本效益和节省时间能力上。结论:开发并实施了一套全面的LLMs定性性能评估框架。该框架超越了单一性能维度。通过专家标注,该方法不仅验证了LLMs处理复杂医疗数据的能力,还为未来跨专业领域的LLM评估建立了基准。