To facilitate healthcare delivery, language models (LMs) have significant potential for clinical prediction tasks using electronic health records (EHRs). However, in these high-stakes applications, unreliable decisions can result in high costs due to compromised patient safety and ethical concerns, thus increasing the need for good uncertainty modeling of automated clinical predictions. To address this, we consider the uncertainty quantification of LMs for EHR tasks in white- and black-box settings. We first quantify uncertainty in white-box models, where we can access model parameters and output logits. We show that an effective reduction of model uncertainty can be achieved by using the proposed multi-tasking and ensemble methods in EHRs. Continuing with this idea, we extend our approach to black-box settings, including popular proprietary LMs such as GPT-4. We validate our framework using longitudinal clinical data from more than 6,000 patients in ten clinical prediction tasks. Results show that ensembling methods and multi-task prediction prompts reduce uncertainty across different scenarios. These findings increase the transparency of the model in white-box and black-box settings, thus advancing reliable AI healthcare.
翻译:为提升医疗服务效率,语言模型在利用电子健康记录进行临床预测任务方面展现出巨大潜力。然而,在这类高风险应用中,不可靠的决策可能因患者安全受损及伦理问题导致高昂代价,因此亟需对自动化临床预测建立完善的不确定性建模。为此,我们研究了白盒与黑盒场景下语言模型在电子健康记录任务中的不确定性量化问题。首先,我们在可访问模型参数与输出逻辑值的白盒模型中量化不确定性,证明通过提出的多任务与集成方法能有效降低电子健康记录任务中的模型不确定性。基于此思路,我们将方法扩展至黑盒场景(包括GPT-4等主流专有语言模型)。我们使用来自6,000余名患者的纵向临床数据,在十项临床预测任务中验证了该框架。结果表明:集成方法与多任务预测提示能降低不同场景下的不确定性。这些发现增强了白盒与黑盒场景下模型的透明度,从而推动可靠人工智能医疗的发展。