Given an observational study with $n$ independent but heterogeneous units, our goal is to learn the counterfactual distribution for each unit using only one $p$-dimensional sample per unit containing covariates, interventions, and outcomes. Specifically, we allow for unobserved confounding that introduces statistical biases between interventions and outcomes as well as exacerbates the heterogeneity across units. Modeling the conditional distribution of the outcomes as an exponential family, we reduce learning the unit-level counterfactual distributions to learning $n$ exponential family distributions with heterogeneous parameters and only one sample per distribution. We introduce a convex objective that pools all $n$ samples to jointly learn all $n$ parameter vectors, and provide a unit-wise mean squared error bound that scales linearly with the metric entropy of the parameter space. For example, when the parameters are $s$-sparse linear combination of $k$ known vectors, the error is $O(s\log k/p)$. En route, we derive sufficient conditions for compactly supported distributions to satisfy the logarithmic Sobolev inequality. As an application of the framework, our results enable consistent imputation of sparsely missing covariates.
翻译:给定一项包含$n$个独立但异质性单元的观测研究,我们的目标是在每个单元仅获取一个包含协变量、干预措施和结局的$p$维样本的条件下,学习每个单元的反事实分布。具体而言,我们允许存在未观测混杂因素,这些因素会引入干预措施与结局之间的统计偏差,并加剧单元间的异质性。将结局的条件分布建模为指数族分布后,我们将学习单元级反事实分布的问题简化为学习$n个具有异质性参数且每个分布仅有一个样本的指数族分布。我们引入一个凸优化目标,通过合并全部$n$个样本联合学习所有$n$个参数向量,并给出一个单元级均方误差界,该误差界与参数空间的度量熵呈线性关系。例如,当参数是$k$个已知向量的$s$稀疏线性组合时,误差为$O(s\log k/p)$。在此过程中,我们推导出紧支撑分布满足对数Sobolev不等式的充分条件。作为该框架的应用,我们的结果可实现稀疏缺失协变量的一致插补。