Consider a setting where there are $N$ heterogeneous units and $p$ interventions. Our goal is to learn unit-specific potential outcomes for any combination of these $p$ interventions, i.e., $N \times 2^p$ causal parameters. Choosing a combination of interventions is a problem that naturally arises in a variety of applications such as factorial design experiments, recommendation engines, combination therapies in medicine, conjoint analysis, etc. Running $N \times 2^p$ experiments to estimate the various parameters is likely expensive and/or infeasible as $N$ and $p$ grow. Further, with observational data there is likely confounding, i.e., whether or not a unit is seen under a combination is correlated with its potential outcome under that combination. To address these challenges, we propose a novel latent factor model that imposes structure across units (i.e., the matrix of potential outcomes is approximately rank $r$), and combinations of interventions (i.e., the coefficients in the Fourier expansion of the potential outcomes is approximately $s$ sparse). We establish identification for all $N \times 2^p$ parameters despite unobserved confounding. We propose an estimation procedure, Synthetic Combinations, and establish it is finite-sample consistent and asymptotically normal under precise conditions on the observation pattern. Our results imply consistent estimation given $\text{poly}(r) \times \left( N + s^2p\right)$ observations, while previous methods have sample complexity scaling as $\min(N \times s^2p, \ \ \text{poly(r)} \times (N + 2^p))$. We use Synthetic Combinations to propose a data-efficient experimental design. Empirically, Synthetic Combinations outperforms competing approaches on a real-world dataset on movie recommendations. Lastly, we extend our analysis to do causal inference where the intervention is a permutation over $p$ items (e.g., rankings).
翻译:考虑存在$N$个异质性单元和$p$种干预的情境。我们的目标是学习任意$p$种干预组合下单元特定的潜在结果,即$N \times 2^p$个因果参数。选择干预组合的问题自然出现在多种应用场景中,例如析因设计实验、推荐引擎、医学联合疗法、联合分析等。随着$N$和$p$增大,开展$N \times 2^p$次实验来估计各类参数可能昂贵且/或不可行。此外,观测数据可能存在混杂效应,即单元在特定组合下被观测到与否与其在该组合下的潜在结果存在相关性。为应对这些挑战,我们提出了一种新颖的潜在因子模型,该模型对单元间的结构(即潜在结果矩阵近似秩$r$)和干预组合的结构(即潜在结果傅里叶展开系数近似$s$稀疏)施加约束。我们在存在未观测混杂的情况下建立了所有$N \times 2^p$个参数的可识别性。我们提出了估计方法——合成组合,并证明在精确的观测模式条件下,该方法具有有限样本一致性和渐近正态性。我们的结果表明,在$\text{poly}(r) \times \left( N + s^2p\right)$个观测值下可实现一致估计,而此前方法的样本复杂度为$\min(N \times s^2p, \ \ \text{poly(r)} \times (N + 2^p))$。我们利用合成组合提出了一种数据高效的实验设计。在电影推荐的真实世界数据集上,合成组合的实证表现优于竞争方法。最后,我们将分析扩展至干预为$p$个项目排列(如排序)时的因果推断场景。