Evidence to guide healthcare decisions is often limited by a lack of relevant and trustworthy literature as well as difficulty in contextualizing existing research for a specific patient. Large language models (LLMs) could potentially address both challenges by either summarizing published literature or generating new studies based on real-world data (RWD). We evaluated the ability of five LLM-based systems in answering 50 clinical questions and had nine independent physicians review the responses for relevance, reliability, and actionability. As it stands, general-purpose LLMs (ChatGPT-4, Claude 3 Opus, Gemini Pro 1.5) rarely produced answers that were deemed relevant and evidence-based (2% - 10%). In contrast, retrieval augmented generation (RAG)-based and agentic LLM systems produced relevant and evidence-based answers for 24% (OpenEvidence) to 58% (ChatRWD) of questions. Only the agentic ChatRWD was able to answer novel questions compared to other LLMs (65% vs. 0-9%). These results suggest that while general-purpose LLMs should not be used as-is, a purpose-built system for evidence summarization based on RAG and one for generating novel evidence working synergistically would improve availability of pertinent evidence for patient care.
翻译:指导医疗决策的证据常常受到相关且可信文献的缺乏以及将现有研究应用于特定患者情境的困难所限制。大语言模型(LLMs)可能通过总结已发表文献或基于真实世界数据(RWD)生成新研究来应对这两项挑战。我们评估了五种基于LLM的系统在回答50个临床问题时的能力,并邀请九位独立医师对回答的相关性、可靠性和可操作性进行评审。目前,通用LLMs(ChatGPT-4、Claude 3 Opus、Gemini Pro 1.5)很少能生成被认为相关且基于证据的答案(2% - 10%)。相比之下,基于检索增强生成(RAG)和智能体LLM系统为24%(OpenEvidence)至58%(ChatRWD)的问题生成了相关且基于证据的答案。仅智能体系统ChatRWD能够回答新颖问题,而其他LLMs则不能(65% vs. 0-9%)。这些结果表明,虽然通用LLMs不应直接使用,但一个基于RAG专门用于证据总结的系统与一个用于生成新证据的系统协同工作,将改善患者护理相关证据的可及性。