The potential to provide patients with faster information access while allowing medical specialists to concentrate on critical tasks makes medical domain dialog agents appealing. However, the integration of large-language models (LLMs) into these agents presents certain limitations that may result in serious consequences. This paper investigates the challenges and risks of using GPT-3-based models for medical question-answering (MedQA). We perform several evaluations contextualized in terms of standard medical principles. We provide a procedure for manually designing patient queries to stress-test high-risk limitations of LLMs in MedQA systems. Our analysis reveals that LLMs fail to respond adequately to these queries, generating erroneous medical information, unsafe recommendations, and content that may be considered offensive.
翻译:通过使医疗专业人员能够集中精力处理关键任务,同时为患者提供更快速的信息获取途径,医疗领域对话代理具有显著的应用潜力。然而,将大语言模型(LLMs)集成到这些代理中会带来某些局限性,可能引发严重后果。本文研究了基于GPT-3的模型在医疗问答(MedQA)中应用所面临的挑战与风险。我们根据标准医疗原则进行了多项评估,并提出了一套手动设计患者查询的方法,以对LLMs在MedQA系统中的高风险局限性进行压力测试。分析表明,LLMs未能对这些查询做出充分响应,生成了错误的医疗信息、不安全的建议以及可能被视为冒犯性的内容。