Background: Evidence-based medicine (EBM) is fundamental to modern clinical practice, requiring clinicians to continually update their knowledge and apply the best clinical evidence in patient care. The practice of EBM faces challenges due to rapid advancements in medical research, leading to information overload for clinicians. The integration of artificial intelligence (AI), specifically Generative Large Language Models (LLMs), offers a promising solution towards managing this complexity. Methods: This study involved the curation of real-world clinical cases across various specialties, converting them into .json files for analysis. LLMs, including proprietary models like ChatGPT 3.5 and 4, Gemini Pro, and open-source models like LLaMA v2 and Mixtral-8x7B, were employed. These models were equipped with tools to retrieve information from case files and make clinical decisions similar to how clinicians must operate in the real world. Model performance was evaluated based on correctness of final answer, judicious use of tools, conformity to guidelines, and resistance to hallucinations. Results: GPT-4 was most capable of autonomous operation in a clinical setting, being generally more effective in ordering relevant investigations and conforming to clinical guidelines. Limitations were observed in terms of model ability to handle complex guidelines and diagnostic nuances. Retrieval Augmented Generation made recommendations more tailored to patients and healthcare systems. Conclusions: LLMs can be made to function as autonomous practitioners of evidence-based medicine. Their ability to utilize tooling can be harnessed to interact with the infrastructure of a real-world healthcare system and perform the tasks of patient management in a guideline directed manner. Prompt engineering may help to further enhance this potential and transform healthcare for the clinician and the patient.
翻译:背景:循证医学是现代临床实践的基础,要求临床医生不断更新知识,并在患者诊疗中应用最佳临床证据。由于医学研究快速进展导致临床医生面临信息过载,循证医学的实践面临挑战。人工智能(特别是生成式大语言模型)的整合为解决这一复杂性提供了有前景的方案。方法:本研究整理跨越不同专科的真实世界临床病例,将其转化为.json文件进行分析。采用包括ChatGPT 3.5和4、Gemini Pro等专有模型,以及LLaMA v2和Mixtral-8x7B等开源模型。这些模型配备从病例文件中检索信息并作出临床决策的工具,模拟临床医生在真实场景中的操作模式。模型性能基于最终答案正确性、工具使用的合理性、指南合规性以及对幻觉的抗性进行评估。结果:GPT-4在临床自主操作方面表现最优,在安排相关检查及遵循临床指南方面通常更为有效。模型在处理复杂指南和诊断细微差异方面仍存在局限性。检索增强生成使推荐内容更符合患者个体特征及医疗体系特点。结论:大语言模型可被训练为循证医学的自主实践者。其工具调用能力可用于与真实医疗体系基础设施交互,以指南导向方式完成患者管理任务。提示工程有望进一步释放其潜能,为临床医生和患者带来医疗变革。