Reinforcement learning-based large language models, such as ChatGPT, are believed to have potential to aid human experts in many domains, including healthcare. There is, however, little work on ChatGPT's ability to perform a key task in healthcare: formal, probabilistic medical diagnostic reasoning. This type of reasoning is used, for example, to update a pre-test probability to a post-test probability. In this work, we probe ChatGPT's ability to perform this task. In particular, we ask ChatGPT to give examples of how to use Bayes rule for medical diagnosis. Our prompts range from queries that use terminology from pure probability (e.g., requests for a posterior of A given B and C) to queries that use terminology from medical diagnosis (e.g., requests for a posterior probability of Covid given a test result and cough). We show how the introduction of medical variable names leads to an increase in the number of errors that ChatGPT makes. Given our results, we also show how one can use prompt engineering to facilitate ChatGPT's partial avoidance of these errors. We discuss our results in light of recent commentaries on sensitivity and specificity. We also discuss how our results might inform new research directions for large language models.
翻译:基于强化学习的大型语言模型(如ChatGPT)被认为有望在包括医疗保健在内的多个领域辅助人类专家。然而,目前关于ChatGPT在医疗诊断关键任务——即形式化概率医学诊断推理——能力的研究尚不充分。此类推理常用于将验前概率更新为验后概率。本研究旨在探究ChatGPT执行该任务的能力,具体通过要求其给出利用贝叶斯规则进行医学诊断的示例。我们设计的提示词涵盖从纯概率术语(如请求给定A与B条件下的后验概率)到医学诊断术语(如请求给定检测结果与咳嗽症状下新冠后验概率)的查询。结果表明,引入医学变量名称会导致ChatGPT的错误数量增加。在此基础上,我们进一步展示了如何通过提示工程促使ChatGPT部分规避这些错误。结合近期关于灵敏度和特异度的评论,我们讨论了研究结果,并阐释其对大型语言模型未来研究方向的启示。