Consider the problem of estimating the causal effect of some attribute of a text document; for example: what effect does writing a polite vs. rude email have on response time? To estimate a causal effect from observational data, we need to adjust for confounding aspects of the text that affect both the treatment and outcome -- e.g., the topic or writing level of the text. These confounding aspects are unknown a priori, so it seems natural to adjust for the entirety of the text (e.g., using a transformer). However, causal identification and estimation procedures rely on the assumption of overlap: for all levels of the adjustment variables, there is randomness leftover so that every unit could have (not) received treatment. Since the treatment here is itself an attribute of the text, it is perfectly determined, and overlap is apparently violated. The purpose of this paper is to show how to handle causal identification and obtain robust causal estimation in the presence of apparent overlap violations. In brief, the idea is to use supervised representation learning to produce a data representation that preserves confounding information while eliminating information that is only predictive of the treatment. This representation then suffices for adjustment and can satisfy overlap. Adapting results on non-parametric estimation, we find that this procedure is robust to conditional outcome misestimation, yielding a low-bias estimator with valid uncertainty quantification under weak conditions. Empirical results show strong improvements in bias and uncertainty quantification relative to the natural baseline.
翻译:考虑以下问题:估计文本文档某些属性的因果效应;例如:撰写礼貌邮件与粗鲁邮件对响应时间有何影响?要从观测数据中估计因果效应,我们需要调整文本中同时影响处理变量和结果的混杂因素——例如文本的主题或写作水平。这些混杂因素先验未知,因此很自然地要调整整个文本(例如使用Transformer模型)。然而,因果识别和估计程序依赖于重叠假设:对于所有调整变量水平,存在剩余随机性使得每个单元都可能(未)接受处理。由于此处处理变量本身是文本的属性,它被完全决定,重叠假设显然被违反。本文旨在展示在明显重叠违例存在时,如何处理因果识别并获得稳健的因果估计。简言之,核心思想是使用监督表示学习生成一种数据表示,保留混杂信息的同时消除仅能预测处理变量的信息。该表示足以进行调整并满足重叠假设。通过适配非参数估计的结果,我们发现该程序对条件结果误估具有稳健性,在弱条件下能产生低偏差估计量并提供有效的不确定性量化。实证结果表明,与自然基线相比,该方法在偏差和不确定性量化方面有显著改进。