Evaluation of GPT-3.5 and GPT-4 for supporting real-world information needs in healthcare delivery

Debadutta Dash,Rahul Thapa,Juan M. Banda,Akshay Swaminathan,Morgan Cheatham,Mehr Kashyap,Nikesh Kotecha,Jonathan H. Chen,Saurabh Gombar,Lance Downing,Rachel Pedreira,Ethan Goh,Angel Arnaout,Garret Kenn Morris,Honor Magon,Matthew P Lungren,Eric Horvitz,Nigam H. Shah

from arxiv, 27 pages including supplemental information

Despite growing interest in using large language models (LLMs) in healthcare, current explorations do not assess the real-world utility and safety of LLMs in clinical settings. Our objective was to determine whether two LLMs can serve information needs submitted by physicians as questions to an informatics consultation service in a safe and concordant manner. Sixty six questions from an informatics consult service were submitted to GPT-3.5 and GPT-4 via simple prompts. 12 physicians assessed the LLM responses' possibility of patient harm and concordance with existing reports from an informatics consultation service. Physician assessments were summarized based on majority vote. For no questions did a majority of physicians deem either LLM response as harmful. For GPT-3.5, responses to 8 questions were concordant with the informatics consult report, 20 discordant, and 9 were unable to be assessed. There were 29 responses with no majority on "Agree", "Disagree", and "Unable to assess". For GPT-4, responses to 13 questions were concordant, 15 discordant, and 3 were unable to be assessed. There were 35 responses with no majority. Responses from both LLMs were largely devoid of overt harm, but less than 20% of the responses agreed with an answer from an informatics consultation service, responses contained hallucinated references, and physicians were divided on what constitutes harm. These results suggest that while general purpose LLMs are able to provide safe and credible responses, they often do not meet the specific information need of a given question. A definitive evaluation of the usefulness of LLMs in healthcare settings will likely require additional research on prompt engineering, calibration, and custom-tailoring of general purpose models.

翻译：尽管利用大型语言模型（LLMs）开展医疗领域的兴趣日益增长，但当前探索并未评估LLMs在临床环境中的实际实用性和安全性。本研究旨在确定两种LLM能否以安全且一致的方式满足医生向信息学咨询服务提交的信息需求。通过简单提示将66个来自信息学咨询服务的问题提交至GPT-3.5和GPT-4，由12名医生评估LLM回答可能对患者造成的伤害程度及其与现有信息学咨询报告的一致性。医生评估结果基于多数投票汇总。对于所有问题，多数医生均未判定任一LLM的回答具有危害性。GPT-3.5中，8个问题的回答与信息学咨询报告一致，20个不一致，9个无法评估；另有29个回答未形成"同意""不同意""无法评估"的多数意见。GPT-4中，13个问题的回答一致，15个不一致，3个无法评估；35个回答无多数意见。两种LLM的回答基本未呈现明显危害，但仅不足20%的回答与信息学咨询服务的答案相符，回答中存在虚构引用现象，且医生对何为危害存在分歧。结果表明，尽管通用型LLM能提供安全可信的回答，但往往无法满足特定问题的信息需求。对LLM在医疗领域实用性的明确评价，可能需要在提示工程、校准及通用模型定制化方面展开进一步研究。