Using Large Language Model Annotations for Valid Downstream Statistical Inference in Social Science: Design-Based Semi-Supervised Learning

In computational social science (CSS), researchers analyze documents to explain social and political phenomena. In most scenarios, CSS researchers first obtain labels for documents and then explain labels using interpretable regression analyses in the second step. The recent advancements in large language models (LLMs) can lower costs for CSS research by annotating documents cheaply at scale, but such surrogate labels are often imperfect and biased. We present a new algorithm for using outputs from LLMs for downstream statistical analyses while guaranteeing statistical properties -- like asymptotic unbiasedness and proper uncertainty quantification -- which are fundamental to CSS research. We show that direct use of LLM-predicted surrogate labels in downstream statistical analyses leads to substantial bias and invalid confidence intervals, even with high surrogate accuracy of 80--90\%. To address this, we build on debiased machine learning to propose the design-based semi-supervised learning (DSL) estimator. DSL employs a doubly-robust procedure to combine surrogate labels with a smaller number of gold-standard labels. Our approach guarantees valid inference for downstream statistical analyses, even when surrogates are arbitrarily biased, without requiring stringent assumptions, by controlling the probability of sampling documents for gold-standard labeling. Both our theoretical analysis and experimental results show that DSL provides valid statistical inference while achieving root mean squared errors comparable to existing alternatives that focus only on prediction without statistical guarantees.

翻译：在计算社会科学中，研究者通过分析文本来解释社会和政治现象。在大多数情况下，计算社会科学研究者首先获取文档标签，然后在第二步中使用可解释的回归分析来解释这些标签。大语言模型的最新进展能够以较低成本大规模标注文档，从而降低计算社会科学研究的成本，但这类替代标签往往不完善且存在偏差。我们提出了一种新算法，用于将大语言模型的输出应用于下游统计分析，同时保证统计性质——如渐近无偏性和适当的不确定性量化——这些性质是计算社会科学研究的基础。我们证明，直接在下游统计分析中使用大语言模型预测的替代标签会导致显著偏差和无效的置信区间，即使替代准确率高达80–90%。为了解决这一问题，我们基于去偏机器学习，提出了基于设计的半监督学习估计量。DSL采用双稳健程序，将替代标签与较少的金标准标签相结合。我们的方法能够保证下游统计分析的推断有效性，即使替代标签存在任意偏差，也无需严格假设，只需控制用于金标准标注的文档抽样概率。我们的理论分析和实验结果均表明，DSL在提供有效统计推断的同时，其均方根误差与仅关注预测而无统计保证的现有替代方法相当。