Reinforcement learning-based large language models, such as ChatGPT, are believed to have potential to aid human experts in many domains, including healthcare. There is, however, little work on ChatGPT's ability to perform a key task in healthcare: formal, probabilistic medical diagnostic reasoning. This type of reasoning is used, for example, to update a pre-test probability to a post-test probability. In this work, we probe ChatGPT's ability to perform this task. In particular, we ask ChatGPT to give examples of how to use Bayes rule for medical diagnosis. Our prompts range from queries that use terminology from pure probability (e.g., requests for a "posterior probability") to queries that use terminology from the medical diagnosis literature (e.g., requests for a "post-test probability"). We show how the introduction of medical variable names leads to an increase in the number of errors that ChatGPT makes. Given our results, we also show how one can use prompt engineering to facilitate ChatGPT's partial avoidance of these errors. We discuss our results in light of recent commentaries on sensitivity and specificity. We also discuss how our results might inform new research directions for large language models.
翻译:基于强化学习的大语言模型(如ChatGPT)被认为在包括医疗保健在内的许多领域具有辅助人类专家的潜力。然而,目前鲜有研究探讨ChatGPT执行医疗保健领域关键任务的能力:形式化的概率性医学诊断推理。这种推理通常用于将验前概率更新为验后概率。本研究旨在探究ChatGPT执行该任务的能力。我们特别要求ChatGPT提供如何运用贝叶斯定理进行医学诊断的示例。我们的提示范围从使用纯概率术语(例如,"后验概率")到使用医学诊断文献术语(例如,"验后概率")的查询。研究表明,引入医学变量名称会导致ChatGPT错误数量的增加。基于实验结果,我们还展示了如何通过提示工程促进ChatGPT部分规避这些错误。我们结合近期关于敏感度与特异度的评论讨论了这些结果,并进一步探讨了这些发现如何为大语言模型的研究方向提供新启示。