The potential to provide patients with faster information access while allowing medical specialists to concentrate on critical tasks makes medical domain dialog agents appealing. However, the integration of large-language models (LLMs) into these agents presents certain limitations that may result in serious consequences. This paper investigates the challenges and risks of using GPT-3-based models for medical question-answering (MedQA). We perform several evaluations contextualized in terms of standard medical principles. We provide a procedure for manually designing patient queries to stress-test high-risk limitations of LLMs in MedQA systems. Our analysis reveals that LLMs fail to respond adequately to these queries, generating erroneous medical information, unsafe recommendations, and content that may be considered offensive.
翻译:医疗领域对话智能体具有为患者提供更快速信息访问、同时使医学专家能专注于关键任务的潜力,因此备受关注。然而,将大语言模型(LLMs)整合到这些智能体中会带来某些局限性,可能导致严重后果。本文研究了基于GPT-3的模型在医学问答(MedQA)中应用所面临的挑战与风险。我们依据标准医学原则进行了多项评估,并设计了一套手动构建患者查询的流程,以对MedQA系统中大语言模型的高风险局限性进行压力测试。分析表明,大语言模型未能充分应对这些查询,生成了错误的医学信息、不安全的建议以及可能被视为冒犯性的内容。