Assessing treatment effects in observational data with missing confounders: A comparative study of practical doubly-robust and traditional missing data methods

Weight · 极大似然 · 似然 · 估计/估计量 · Performer ·

2024 年 12 月 19 日

翻译：观察性数据中混杂因素缺失时治疗效应的评估：实用双重稳健与传统缺失数据方法的比较研究

Brian D. Williamson,Chloe Krakauer,Eric Johnson,Susan Gruber,Bryan E. Shepherd,Mark J. van der Laan,Thomas Lumley,Hana Lee,Jose J. Hernandez Munoz,Fengyu Zhao,Sarah K. Dutcher,Rishi Desai,Gregory E. Simon,Susan M. Shortreed,Jennifer C. Nelson,Pamela A. Shaw

from arxiv, 142 pages (27 main, 115 supplemental); 6 figures, 2 tables

In pharmacoepidemiology, safety and effectiveness are frequently evaluated using readily available administrative and electronic health records data. In these settings, detailed confounder data are often not available in all data sources and therefore missing on a subset of individuals. Multiple imputation (MI) and inverse-probability weighting (IPW) are go-to analytical methods to handle missing data and are dominant in the biomedical literature. Doubly-robust methods, which are consistent under fewer assumptions, can be more efficient with respect to mean-squared error. We discuss two practical-to-implement doubly-robust estimators, generalized raking and inverse probability-weighted targeted maximum likelihood estimation (TMLE), which are both currently under-utilized in biomedical studies. We compare their performance to IPW and MI in a detailed numerical study for a variety of synthetic data-generating and missingness scenarios, including scenarios with rare outcomes and a high missingness proportion. Further, we consider plasmode simulation studies that emulate the complex data structure of a large electronic health records cohort in order to compare anti-depressant therapies in a rare-outcome setting where a key confounder is prone to more than 50\% missingness. We provide guidance on selecting a missing data analysis approach, based on which methods excelled with respect to the bias-variance trade-off across the different scenarios studied.

翻译：在药物流行病学中，安全性和有效性常使用现成的行政管理和电子健康记录数据进行评估。在这些场景中，详细的混杂因素数据往往并非在所有数据源中都可用，因此会在部分个体中缺失。多重插补（MI）和逆概率加权（IPW）是处理缺失数据的首选分析方法，并在生物医学文献中占主导地位。双重稳健方法在更少的假设下具有一致性，并且在均方误差方面可能更高效。我们讨论了两种易于实现的实用双重稳健估计量——广义校正加权和逆概率加权的目标最大似然估计（TMLE），这两种方法目前在生物医学研究中均未得到充分利用。我们通过详细的数值研究，在各种合成数据生成和缺失情景下（包括罕见结局和高缺失比例的情景），比较了它们与IPW和MI的性能。此外，我们考虑了模拟大型电子健康记录队列复杂数据结构的质体模拟研究，以便在一个关键混杂因素易于出现超过50%缺失的罕见结局场景中，比较抗抑郁疗法的效果。基于所研究的不同情景下在偏差-方差权衡方面表现优异的方法，我们为选择缺失数据分析方法提供了指导。