Accurate predictions, as with machine learning, may not suffice to provide optimal healthcare for every patient. Indeed, prediction can be driven by shortcuts in the data, such as racial biases. Causal thinking is needed for data-driven decisions. Here, we give an introduction to the key elements, focusing on routinely-collected data, electronic health records (EHRs) and claims data. Using such data to assess the value of an intervention requires care: temporal dependencies and existing practices easily confound the causal effect. We present a step-by-step framework to help build valid decision making from real-life patient records by emulating a randomized trial before individualizing decisions, eg with machine learning. Our framework highlights the most important pitfalls and considerations in analysing EHRs or claims data to draw causal conclusions. We illustrate the various choices in studying the effect of albumin on sepsis mortality in the Medical Information Mart for Intensive Care database (MIMIC-IV). We study the impact of various choices at every step, from feature extraction to causal-estimator selection. In a tutorial spirit, the code and the data are openly available.
翻译:机器学习提供的精确预测,可能不足以确保每位患者获得最优医疗方案。事实上,预测结果可能受到数据中捷径的影响,例如种族偏见。数据驱动的决策需要因果思维。本文聚焦于常规收集的数据、电子健康记录(EHR)及理赔数据,对其关键要素进行介绍。利用此类数据评估干预措施的价值需谨慎:时间依赖性和现有实践容易混淆因果效应。我们提出一个分步框架,通过模拟随机试验(在个体化决策前,例如使用机器学习),帮助从真实患者记录中构建有效的决策过程。该框架重点揭示了分析EHR或理赔数据以得出因果结论时最重要的陷阱和注意事项。我们在重症监护医学信息集市数据库(MIMIC-IV)中以白蛋白对脓毒症死亡率的影响为例,阐述了各个步骤中的不同选择。我们研究了每一步骤中从特征提取到因果估计器选择等各种选择的影响。秉持教程精神,代码和数据均已公开提供。