We seek to democratise public-opinion research by providing practitioners with a general methodology to make representative inference from cheap, high-frequency, highly unrepresentative samples. We focus specifically on samples which are readily available in moderate sizes. To this end, we provide two major contributions: 1) we introduce a general sample-selection process which we name online selection, and show it is a special-case of selection on the dependent variable. We improve MrP for severely biased samples by introducing a bias-correction term in the style of King and Zeng to the logistic-regression framework. We show this bias-corrected model outperforms traditional MrP under online selection, and achieves performance similar to random-sampling in a vast array of scenarios; 2) we present a protocol to use Large Language Models (LLMs) to extract structured, survey-like data from social-media. We provide a prompt-style that can be easily adapted to a variety of survey designs. We show that LLMs agree with human raters with respect to the demographic, socio-economic and political characteristics of these online users. The end-to-end implementation takes unrepresentative, unsrtuctured social media data as inputs, and produces timely high-quality area-level estimates as outputs. This is Artificially Intelligent Opinion Polling. We show that our AI polling estimates of the 2020 election are highly accurate, on-par with estimates produced by state-level polling aggregators such as FiveThirtyEight, or from MrP models fit to extremely expensive high-quality samples.
翻译:我们旨在通过为从业者提供一种通用方法论,使其能够从低成本、高频率、高度无代表性的样本中进行具有代表性的推断,从而推动公共意见研究的民主化。我们特别关注那些易获取的中等规模样本。为此,我们做出两项主要贡献:1) 提出一种名为“在线选择”的通用样本选择过程,并证明其属于因变量选择的一种特例。我们通过引入King和Zeng风格的偏差校正项到逻辑回归框架中,改进了针对严重偏差样本的MrP方法。实验表明,该偏差校正模型在在线选择下优于传统MrP,并在广泛场景下达到与随机抽样相似的性能;2) 提出一套利用大型语言模型从社交媒体中提取结构化调查数据的协议。我们提供了一种可轻松适配多种调查设计的提示风格,并证明这些模型在识别在线用户的人口统计、社会经济和政治特征方面与人类评分者高度一致。端到端实现以无代表性、非结构化的社交媒体数据为输入,输出及时且高质量的区域级估计结果。这便是“人工智能民意调查”。我们证明,该AI民意调查对2020年总统选举的估计高度准确,与FiveThirtyEight等州级民调聚合器或基于极其昂贵的高质量样本拟合的MrP模型得出的估计结果不相上下。