Application of Propensity Score Models and Causal Estimators in Observational Studies under Model Misspecification

Propensity score (PS) methods are widely used in observational studies to reduce confounding and estimate causal treatment effects. However, the validity of PS-based causal estimators depends heavily on correct model specification, and model misspecification may lead to substantial bias and instability. In this study, we systematically evaluate the performance of commonly used causal estimators, including response surface modeling (RSM), inverse probability weighting (IPW), and augmented inverse probability weighting (AIPW), under varying levels of PS and outcome model misspecification. We compare classical logistic regression with several machine learning approaches for PS estimation, including random forests (RF), support vector machines (SVM), and linear discriminant analysis (LDA). Extensive simulation studies were conducted under multiple scenarios defined by combinations of correctly specified and misspecified PS and outcome models, varying sample sizes, and different covariate correlation structures. Estimator performance was assessed using bias, absolute bias, root mean squared error, empirical standard error, and confidence interval width. Results demonstrate that AIPW consistently provides robust and stable estimates across most scenarios due to its doubly robust property, whereas IPW is highly sensitive to PS misspecification and unstable PS estimates produced by flexible machine learning methods. RSM performs well only when the outcome model is correctly specified. Real-world applications using the ACTG175 clinical trial and the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset further illustrate the practical implications of estimator choice and PS modeling strategy. Overall, our findings highlight the importance of integrating flexible machine learning approaches within doubly robust frameworks to improve causal effect estimation in observational studies.

翻译：倾向性评分（PS）方法广泛应用于观察性研究，以减少混杂偏倚并估计因果处理效应。然而，基于倾向性评分的因果估计量的有效性高度依赖于模型的正确设定，模型设定错误可能导致显著的偏差和不稳定性。本研究系统评估了常用因果估计量（包括响应面建模RSM、逆概率加权IPW和增强逆概率加权AIPW）在不同水平的倾向性评分与结果模型设定错误下的表现。我们将经典逻辑回归与多种机器学习方法（包括随机森林RF、支持向量机SVM和线性判别分析LDA）进行倾向性评分估计比较。通过多种场景（包括正确设定与错误设定的PS和结果模型组合、不同样本量以及不同协变量相关结构）开展广泛模拟研究，采用偏差、绝对偏差、均方根误差、经验标准误和置信区间宽度评估估计量性能。结果表明，AIPW凭借其双重稳健特性在大多数场景中始终提供稳健稳定的估计，而IPW对PS设定错误和灵活机器学习方法产生的不稳定PS估计高度敏感。RSM仅在结果模型正确设定时表现良好。利用ACTG175临床试验和阿尔茨海默病神经影像学倡议（ADNI）数据集的真实世界应用进一步说明了估计量选择和PS建模策略的实际影响。总体而言，我们的研究结果强调了在双重稳健框架内整合灵活机器学习方法以改善观察性研究中因果效应估计的重要性。