In causal inference, matching is one of the most widely used methods to mimic a randomized experiment using observational (non-experimental) data. Ideally, treated units are exactly matched with control units for the covariates so that the treatments are as-if randomly assigned within each matched set, and valid randomization tests for treatment effects can then be conducted as in a randomized experiment. However, inexact matching typically exists, especially when there are continuous or many observed covariates or when unobserved covariates exist. Previous matched observational studies routinely conducted downstream randomization tests as if matching was exact, as long as the matched datasets satisfied some prespecified balance criteria or passed some balance tests. Some recent studies showed that this routine practice could render a highly inflated type-I error rate of randomization tests, especially when the sample size is large. To handle this problem, we propose an iterative convex programming framework for randomization tests with inexactly matched datasets. Under some commonly used regularity conditions, we show that our approach can produce valid randomization tests (i.e., robustly controlling the type-I error rate) for any inexactly matched datasets, even when unobserved covariates exist. Our framework allows the incorporation of flexible machine learning models to better extract information from covariate imbalance while robustly controlling the type-I error rate.
翻译:在因果推断中,匹配是使用观测(非实验)数据模拟随机化实验最广泛使用的方法之一。理想情况下,处理组与对照组在协变量层面实现精确匹配,使得各匹配集内的处理分配近似随机,从而可像随机实验那样对处理效应进行有效的随机化检验。然而,当存在连续型或高维观测协变量,或存在未观测协变量时,不精确匹配通常难以避免。以往的匹配观测研究常将下游随机化检验视为精确匹配下进行,只要匹配数据集满足某些预设的平衡准则或通过平衡检验即可。近期研究表明,这种常规做法可能导致随机化检验的I类错误率显著膨胀,尤其在样本量较大时。针对这一问题,我们提出了一种面向不精确匹配数据集的迭代凸规划框架。在若干常用正则性条件下,我们证明该方法能为任何不精确匹配数据集(即使存在未观测协变量)生成有效的随机化检验(即稳健控制I类错误率)。该框架可集成灵活的机器学习模型,在稳健控制I类错误率的同时,更好地从协变量不平衡中提取信息。