Machine learning for early prediction in medicine has recently shown breakthrough performance, however, the focus on improving prediction accuracy has led to a neglect of faithful explanations that are required to gain the trust of medical practitioners. The goal of this paper is to teach LLMs to follow medical consensus guidelines step-by-step in their reasoning and prediction process. Since consensus guidelines are ubiquitous in medicine, instantiations of verbalized medical inference rules to electronic health records provide data for fine-tuning LLMs to learn consensus rules and possible exceptions thereof for many medical areas. Consensus rules also enable an automatic evaluation of the model's inference process regarding its derivation correctness (evaluating correct and faithful deduction of a conclusion from given premises) and value correctness (comparing predicted values against real-world measurements). We exemplify our work using the complex Sepsis-3 consensus definition. Our experiments show that small fine-tuned models outperform one-shot learning of considerably larger LLMs that are prompted with the explicit definition and models that are trained on medical texts including consensus definitions. Since fine-tuning on verbalized rule instantiations of a specific medical area yields nearly perfect derivation correctness for rules (and exceptions) on unseen patient data in that area, the bottleneck for early prediction is not out-of-distribution generalization, but the orthogonal problem of generalization into the future by forecasting sparsely and irregularly sampled clinical variables. We show that the latter results can be improved by integrating the output representations of a time series forecasting model with the LLM in a multimodal setup.
翻译:医学早期预测的机器学习近期展现出突破性性能,然而对预测准确性的过度关注导致忽视了获取医疗从业者信任所需的可靠解释。本文旨在教导大型语言模型在推理和预测过程中逐步遵循医学共识指南。由于共识指南在医学领域无处不在,将口头化的医学推理规则实例化到电子健康记录中,为微调大型语言模型提供了数据,使其能够学习多个医学领域的共识规则及其可能的例外情况。共识规则还支持对模型推理过程进行自动评估,包括推导正确性(评估从给定前提正确且可靠地推导结论)和数值正确性(将预测值与实际测量值进行比较)。我们以复杂的脓毒症-3共识定义为例展示了本工作。实验表明,经过微调的小型模型在性能上超越了使用显式定义进行单样本学习的更大规模大型语言模型,以及训练时包含共识定义的医学文本的模型。由于在特定医学领域对口头化规则实例进行微调,可在该领域未见患者数据上实现近乎完美的规则(及例外)推导正确性,早期预测的瓶颈并非分布外泛化,而是通过预测稀疏且不规则采样的临床变量来实现面向未来的泛化这一正交问题。我们证明,通过在多模态设置中将时间序列预测模型的输出表示与大型语言模型集成,可以改善后者的结果。