Improving Bias Correction Standards by Quantifying its Effects on Treatment Outcomes

With the growing access to administrative health databases, retrospective studies have become crucial evidence for medical treatments. Yet, non-randomized studies frequently face selection biases, requiring mitigation strategies. Propensity score matching (PSM) addresses these biases by selecting comparable populations, allowing for analysis without further methodological constraints. However, PSM has several drawbacks. Different matching methods can produce significantly different Average Treatment Effects (ATE) for the same task, even when meeting all validation criteria. To prevent cherry-picking the best method, public authorities must involve field experts and engage in extensive discussions with researchers. To address this issue, we introduce a novel metric, A2A, to reduce the number of valid matches. A2A constructs artificial matching tasks that mirror the original ones but with known outcomes, assessing each matching method's performance comprehensively from propensity estimation to ATE estimation. When combined with Standardized Mean Difference, A2A enhances the precision of model selection, resulting in a reduction of up to 50% in ATE estimation errors across synthetic tasks and up to 90% in predicted ATE variability across both synthetic and real-world datasets. To our knowledge, A2A is the first metric capable of evaluating outcome correction accuracy using covariates not involved in selection. Computing A2A requires solving hundreds of PSMs, we therefore automate all manual steps of the PSM pipeline. We integrate PSM methods from Python and R, our automated pipeline, a new metric, and reproducible experiments into popmatch, our new Python package, to enhance reproducibility and accessibility to bias correction methods.

翻译：随着行政健康数据库的日益普及，回顾性研究已成为医疗干预的重要证据来源。然而，非随机研究常常面临选择偏倚问题，需要采取缓解策略。倾向得分匹配（PSM）通过选择可比人群来应对这些偏倚，从而允许在不施加额外方法学约束的情况下进行分析。然而，PSM存在若干缺陷。即使满足所有验证标准，不同的匹配方法也可能对同一任务产生显著不同的平均处理效应（ATE）。为防止选择最佳方法的“樱桃采摘”行为，公共机构必须纳入领域专家并与研究人员进行广泛讨论。为解决这一问题，我们引入了一种新颖的指标A2A，以减少有效匹配的数量。A2A构建了与原始任务相似但结果已知的人工匹配任务，从倾向得分估计到ATE估计全面评估每种匹配方法的性能。当与标准化均值差结合使用时，A2A提高了模型选择的精确度，在合成任务中使ATE估计误差降低高达50%，在合成及真实数据集中使预测ATE的变异性降低高达90%。据我们所知，A2A是首个能够利用未参与选择的协变量来评估结果校正准确性的指标。计算A2A需要求解数百个PSM问题，因此我们实现了PSM流程中所有手动步骤的自动化。我们将Python和R中的PSM方法、我们的自动化流程、新指标以及可复现实验集成到我们的新Python包popmatch中，以提升偏倚校正方法的可复现性和可及性。