We consider a setting with $N$ heterogeneous units and $p$ interventions. Our goal is to learn unit-specific potential outcomes for any combination of these $p$ interventions, i.e., $N \times 2^p$ causal parameters. Choosing combinations of interventions is a problem that naturally arises in many applications such as factorial design experiments, recommendation engines (e.g., showing a set of movies that maximizes engagement for users), combination therapies in medicine, selecting important features for ML models, etc. Running $N \times 2^p$ experiments to estimate the various parameters is infeasible as $N$ and $p$ grow. Further, with observational data there is likely confounding, i.e., whether or not a unit is seen under a combination is correlated with its potential outcome under that combination. To address these challenges, we propose a novel model that imposes latent structure across both units and combinations. We assume latent similarity across units (i.e., the potential outcomes matrix is rank $r$) and regularity in how combinations interact (i.e., the coefficients in the Fourier expansion of the potential outcomes is $s$ sparse). We establish identification for all causal parameters despite unobserved confounding. We propose an estimation procedure, Synthetic Combinations, and establish finite-sample consistency under precise conditions on the observation pattern. Our results imply Synthetic Combinations consistently estimates unit-specific potential outcomes given $\text{poly}(r) \times (N + s^2p)$ observations. In comparison, previous methods that do not exploit structure across both units and combinations have sample complexity scaling as $\min(N \times s^2p, \ \ r \times (N + 2^p))$. We use Synthetic Combinations to propose a data-efficient experimental design mechanism for combinatorial causal inference. We corroborate our theoretical findings with numerical simulations.
翻译:我们考虑一个包含N个异质性单元和p种干预措施的设定。目标是学习任意p种干预组合下单位特定的潜在结果,即N×2^p个因果参数。选择干预组合是许多应用中自然出现的问题,例如析因设计实验、推荐引擎(如为用户展示最大化参与度的电影集合)、医学中的联合疗法、机器学习模型的重要特征选择等。随着N和p的增长,通过运行N×2^p个实验来估计各类参数变得不可行。此外,在观测数据中可能存在混杂偏倚,即单元出现在某组合下的情况与其在该组合下的潜在结果存在相关性。为应对这些挑战,我们提出一种同时引入单元和组合间隐式结构的新型模型。我们假设单元间存在潜在相似性(即潜在结果矩阵的秩为r),并且组合交互具有规律性(即潜在结果傅里叶展开系数具有s稀疏性)。尽管存在未观测混杂因素,我们仍建立了所有因果参数的可识别性。我们提出估计方法"合成组合"(Synthetic Combinations),并在精确的观测模式条件下证明其有限样本一致性。结果表明,在给定poly(r)×(N+s²p)个观测数据时,合成组合能一致估计单位特定潜在结果。相比之下,未同时利用单元与组合结构的方法需要min(N×s²p, r×(N+2^p))量级的样本复杂度。我们利用合成组合提出一种数据高效的实验设计机制用于组合因果推断,并通过数值模拟验证了理论发现。