In this paper, we demonstrate how to enhance the validity of causal inference with unstructured high-dimensional treatments like texts, by leveraging the power of generative Artificial Intelligence. Specifically, we propose to use a deep generative model such as large language models (LLMs) to efficiently generate treatments and use their internal representation for subsequent causal effect estimation. We show that the knowledge of this true internal representation helps separate the treatment features of interest, such as specific sentiments and certain topics, from other possibly unknown confounding features. Unlike the existing methods, our proposed approach eliminates the need to learn causal representation from the data and hence produces more accurate and efficient estimates. We formally establish the conditions required for the nonparametric identification of the average treatment effect, propose an estimation strategy that avoids the violation of the overlap assumption, and derive the asymptotic properties of the proposed estimator through the application of double machine learning. Finally, using an instrumental variables approach, we extend the proposed methodology to the settings, in which the treatment feature is based on human perception rather than is assumed to be fixed given the treatment object. We conduct simulation studies using the generated text data with an open-source LLM, Llama3, to illustrate the advantages of our estimator over the state-of-the-art causal representation learning algorithms.
翻译:本文展示了如何利用生成式人工智能的强大能力,提升对文本这类非结构化高维处理变量进行因果推断的有效性。具体而言,我们提出使用大型语言模型等深度生成模型来高效生成处理变量,并利用其内部表征进行后续的因果效应估计。我们证明,掌握这种真实的内部表征有助于将感兴趣的处理特征(如特定情感和特定主题)与其他可能未知的混杂特征分离开来。与现有方法不同,我们提出的方法无需从数据中学习因果表征,从而能够产生更准确、更高效的估计。我们正式建立了非参数识别平均处理效应所需的条件,提出了一种避免违反重叠性假设的估计策略,并通过应用双重机器学习推导了所提出估计量的渐近性质。最后,我们采用工具变量方法,将所提出的方法论扩展到处理特征基于人类感知而非假定在处理对象给定的情况下固定的场景。我们使用开源LLM Llama3生成的文本数据进行模拟研究,以说明我们的估计量相较于最先进的因果表征学习算法的优势。