Partial Identification under Missing Data Using Weak Shadow Variables from Pretrained Models

Estimating population quantities such as mean outcomes from user feedback is fundamental to platform evaluation and social science, yet feedback is often missing not at random (MNAR): users with stronger opinions are more likely to respond, so standard estimators are biased and the estimand is not identified without additional assumptions. Existing approaches typically rely on strong parametric assumptions or bespoke auxiliary variables that may be unavailable in practice. In this paper, we develop a partial identification framework in which sharp bounds on the estimand are obtained by solving a pair of linear programs whose constraints encode the observed data structure. This formulation naturally incorporates outcome predictions from pretrained models, including large language models (LLMs), as additional linear constraints that tighten the feasible set. We call these predictions weak shadow variables: they satisfy a conditional independence assumption with respect to missingness but need not meet the completeness conditions required by classical shadow-variable methods. When predictions are sufficiently informative, the bounds collapse to a point, recovering standard identification as a special case. In finite samples, to provide valid coverage of the identified set, we propose a set-expansion estimator that achieves slower-than-$\sqrt{n}$ convergence rate in the set-identified regime and the standard $\sqrt{n}$ rate under point identification. In simulations and semi-synthetic experiments on customer-service dialogues, we find that LLM predictions are often ill-conditioned for classical shadow-variable methods yet remain highly effective in our framework. They shrink identification intervals by 75--83\% while maintaining valid coverage under realistic MNAR mechanisms.

翻译：从用户反馈中估计总体量（如均值）是平台评估和社会科学的基础，但反馈通常存在非随机缺失（Missing Not at Random, MNAR），即意见较强的用户更可能回应，因此标准估计量存在偏差，且若无额外假设则无法识别目标量。现有方法通常依赖于强参数假设或实际中可能无法获取的定制辅助变量。本文提出一种部分识别框架，通过求解一对线性规划问题获得目标量的尖锐界，其中约束条件编码了观测数据的结构。该公式自然地将预训练模型（包括大语言模型，LLMs）的输出预测作为额外的线性约束纳入，以收紧可行集。我们将这些预测称为弱影子变量：它们满足关于缺失机制的条件独立性假设，但无需满足经典影子变量方法所需的完备性条件。当预测信息足够充分时，界限坍缩为一点，从而将标准识别作为特例恢复。在有限样本情形下，为提供对识别集的有效覆盖，我们提出一种集扩张估计量：它在集识别情形下达到慢于$\sqrt{n}$的收敛速度，而在点识别情形下达到标准$\sqrt{n}$速率。通过在客服对话模拟和半合成实验中发现，LLM预测对经典影子变量方法常呈现病态条件，但在我们的框架中仍保持高效性。在现实MNAR机制下，该方法能将识别区间缩减75–83%，同时维持有效覆盖。