Estimating causal treatment effects in observational settings is frequently compromised by selection bias arising from unobserved confounders. While traditional econometric methods struggle when these confounders are orthogonal to structured covariates, high-dimensional unstructured text often contains rich proxies for these latent variables. This study proposes a Neural Network-Enhanced Double Machine Learning (DML) framework designed to leverage text embeddings for causal identification. Using a rigorous synthetic benchmark, we demonstrate that unstructured text embeddings capture critical confounding information that is absent from structured tabular data. However, we show that standard tree-based DML estimators retain substantial bias (+24%) due to their inability to model the continuous topology of embedding manifolds. In contrast, our deep learning approach reduces bias to -0.86% with optimized architectures, effectively recovering the ground-truth causal parameter. These findings suggest that deep learning architectures are essential for satisfying the unconfoundedness assumption when conditioning on high-dimensional natural language data
翻译:在观测性研究中估计因果处理效应常因未观测混杂因素导致的选择偏倚而失真。当这些混杂因素与结构化协变量正交时,传统计量经济学方法往往失效,而高维非结构化文本数据通常蕴含这些潜变量的丰富代理信息。本研究提出一种神经网络增强的双重机器学习框架,旨在利用文本嵌入实现因果识别。通过严格的合成基准测试,我们证明非结构化文本嵌入能捕获结构化表格数据中缺失的关键混杂信息。然而,我们发现标准基于树的DML估计量因无法建模嵌入流形的连续拓扑结构而存在显著偏倚(+24%)。相比之下,我们的深度学习方法通过优化架构将偏倚降低至-0.86%,有效恢复了真实因果参数。这些发现表明,当以高维自然语言数据为条件时,深度学习架构对于满足无混杂性假设具有不可或缺的作用。