Recent text-based causal methods attempt to mitigate confounding bias by estimating proxies of confounding variables that are partially or imperfectly measured from unstructured text data. These approaches, however, assume analysts have supervised labels of the confounders given text for a subset of instances, a constraint that is sometimes infeasible due to data privacy or annotation costs. In this work, we address settings in which an important confounding variable is completely unobserved. We propose a new causal inference method that uses multiple instances of pre-treatment text data, infers two proxies from two zero-shot models on the separate instances, and applies these proxies in the proximal g-formula. We prove that our text-based proxy method satisfies identification conditions required by the proximal g-formula while other seemingly reasonable proposals do not. We evaluate our method in synthetic and semi-synthetic settings and find that it produces estimates with low bias. To address untestable assumptions associated with the proximal g-formula, we further propose an odds ratio falsification heuristic. This new combination of proximal causal inference and zero-shot classifiers expands the set of text-specific causal methods available to practitioners.
翻译:近年来,基于文本的因果推断方法试图通过从非结构化文本数据中估计部分或不完全测量的混杂变量代理来减轻混杂偏倚。然而,这些方法假设分析人员拥有给定文本的混杂变量监督标签(针对部分实例),这一约束有时因数据隐私或标注成本而难以实现。在本研究中,我们处理一个重要混杂变量完全未被观测到的场景。我们提出一种新的因果推断方法,该方法使用多个预处理文本数据实例,通过两个零样本模型在独立实例上推断出两个代理变量,并将这些代理应用于近端g公式。我们证明,本文提出的基于文本的代理方法满足近端g公式所需的识别条件,而其他看似合理的方案则无法满足。我们在合成与半合成场景中评估了该方法,发现其能产生低偏倚的估计结果。针对近端g公式相关的不可检验假设,我们进一步提出了比值比证伪启发式方法。这种近端因果推断与零样本分类器的新组合,扩展了可供实践者使用的文本特异性因果推断方法集合。