Generative Augmented Inference

Data-driven operations management often relies on parameters estimated from costly human-generated labels. Recent advances in large language models (LLMs) and other AI systems offer inexpensive auxiliary data, but introduce a new challenge: AI outputs are not direct observations of the target outcomes, but could involve high-dimensional representations with complex and unknown relationships to human labels. Conventional methods leverage AI predictions as direct proxies for true labels, which can be inefficient or unreliable when this relationship is weak or misspecified. We propose Generative Augmented Inference (GAI), a general framework that incorporates AI-generated outputs as informative features for estimating models of human-labeled outcomes. GAI uses an orthogonal moment construction that enables consistent estimation and valid inference with flexible, nonparametric relationship between LLM-generated outputs and human labels. We establish asymptotic normality and show a "safe default" property: relative to human-data-only estimators, GAI weakly improves estimation efficiency under arbitrary auxiliary signals and yields strict gains whenever the auxiliary information is predictive. Empirically, GAI outperforms benchmarks across diverse settings. In conjoint analysis with weak auxiliary signals, GAI reduces estimation error by about 50% and lowers human labeling requirements by over 75%. In retail pricing, where all methods access the same auxiliary inputs, GAI consistently outperforms alternative estimators, highlighting the value of its construction rather than differences in information. In health insurance choice, it cuts labeling requirements by over 90% while maintaining decision accuracy. Across applications, GAI improves confidence interval coverage without inflating width. Overall, GAI provides a principled and scalable approach to integrating AI-generated information.

翻译：数据驱动的运营管理通常依赖于从昂贵的人工标注参数中估计的参数。大语言模型（LLMs）及其他AI系统的最新进展提供了廉价的辅助数据，但引入了新挑战：AI输出并非目标结果的直接观测，而是可能包含与人工标注存在复杂未知关系的高维表征。传统方法将AI预测直接作为真实标签的代理变量，当这种关系较弱或设定错误时，效率低下且不可靠。本文提出生成增强推断（GAI）——一种将AI生成输出作为信息特征融入人工标注结果模型估计的通用框架。GAI采用正交矩构造方法，可在LLM生成输出与人工标注之间建立灵活的非参数关系，实现一致估计和有效推断。我们建立了渐近正态性并证明了“安全默认”性质：相比仅使用人工数据的估计量，GAI在任意辅助信号下均能弱改进估计效率，且当辅助信息具有预测能力时必然带来严格增益。实验表明，GAI在多种场景中均优于基准方法。在弱辅助信号的联合分析中，GAI将估计误差降低约50%，人工标注需求减少超75%。在零售定价场景（所有方法使用相同的辅助输入）中，GAI始终优于替代估计量，凸显其构造方法的独特价值而非信息差异。在健康保险选择场景中，GAI在保持决策精度的同时将标注需求削减超90%。在所有应用中，GAI在不增加置信区间宽度的前提下提升了区间覆盖率。总体而言，GAI为整合AI生成信息提供了原则性且可扩展的解决方案。