Reinforcement learning-based large language models, such as ChatGPT, are believed to have potential to aid human experts in many domains, including healthcare. There is, however, little work on ChatGPT's ability to perform a key task in healthcare: formal, probabilistic medical diagnostic reasoning. This type of reasoning is used, for example, to update a pre-test probability to a post-test probability. In this work, we probe ChatGPT's ability to perform this task. In particular, we ask ChatGPT to give examples of how to use Bayes rule for medical diagnosis. Our prompts range from queries that use terminology from pure probability (e.g., requests for a posterior of A given B and C) to queries that use terminology from medical diagnosis (e.g., requests for a posterior probability of Covid given a test result and cough). We show how the introduction of medical variable names leads to an increase in the number of errors that ChatGPT makes. Given our results, we also show how one can use prompt engineering to facilitate ChatGPT's partial avoidance of these errors. We discuss our results in light of recent commentaries on sensitivity and specificity. We also discuss how our results might inform new research directions for large language models.
翻译:基于强化学习的大型语言模型,如ChatGPT,被认为在包括医疗保健在内的许多领域具有辅助人类专家的潜力。然而,关于ChatGPT执行医疗保健中一项关键任务——即形式化的概率医学诊断推理——的能力,目前研究甚少。此类推理常用于将验前概率更新为验后概率。在本研究中,我们探究了ChatGPT执行此任务的能力。具体而言,我们要求ChatGPT举例说明如何将贝叶斯规则应用于医学诊断。我们的提示范围涵盖从使用纯概率术语的查询(例如,请求给定B和C条件下A的后验概率)到使用医学诊断术语的查询(例如,请求给定检测结果和咳嗽症状条件下新冠的后验概率)。我们展示了引入医学变量名称如何导致ChatGPT错误数量的增加。基于实验结果,我们还展示了如何通过提示工程促使ChatGPT部分避免这些错误。我们结合近期关于敏感性与特异性的评论来讨论本研究的结果,并探讨这些结果如何为大型语言模型的新研究方向提供参考。