Consider the problem of estimating the causal effect of some attribute of a text document; for example: what effect does writing a polite vs. rude email have on response time? To estimate a causal effect from observational data, we need to adjust for confounding aspects of the text that affect both the treatment and outcome -- e.g., the topic or writing level of the text. These confounding aspects are unknown a priori, so it seems natural to adjust for the entirety of the text (e.g., using a transformer). However, causal identification and estimation procedures rely on the assumption of overlap: for all levels of the adjustment variables, there is randomness leftover so that every unit could have (not) received treatment. Since the treatment here is itself an attribute of the text, it is perfectly determined, and overlap is apparently violated. The purpose of this paper is to show how to handle causal identification and obtain robust causal estimation in the presence of apparent overlap violations. In brief, the idea is to use supervised representation learning to produce a data representation that preserves confounding information while eliminating information that is only predictive of the treatment. This representation then suffices for adjustment and can satisfy overlap. Adapting results on non-parametric estimation, we find that this procedure is robust to conditional outcome misestimation, yielding a low-bias estimator with valid uncertainty quantification under weak conditions. Empirical results show strong improvements in bias and uncertainty quantification relative to the natural baseline.
翻译:考虑估计文本文档某属性因果效应的问题;例如:撰写礼貌邮件与粗鲁邮件对回复时间有何影响?要从观测数据中估计因果效应,需调整文本中同时影响处理变量和结果变量的混杂因素——如文本主题或写作水平。这些混杂因素先验未知,因此自然希望调整整个文本特征(例如使用Transformer)。然而,因果识别与估计方法依赖于重叠假设:调整变量各水平下仍存在随机性,使每个单元可能(未)接受处理。由于本文中处理变量本身是文本属性,它被完全确定,重叠假设显然被违反。本文旨在展示如何在表面上的重叠违反情况下进行因果识别并获得稳健的因果估计。简言之,核心思想是使用监督表示学习生成一种数据表示,保留混杂信息同时消除仅能预测处理变量的信息。该表示足以用于调整且可满足重叠假设。通过适配非参数估计结果,我们发现该方法对条件结果误估计具有稳健性,能在弱条件下得到低偏估计量及有效的不确定性量化。实证结果表明,相较于自然基线方法,该方法在偏误和不确定性量化方面均有显著改进。