Large language models (LLMs) are increasingly used to predict human behavior. We propose a measure for evaluating how much knowledge a pretrained LLM brings to such a prediction: its equivalent sample size, defined as the amount of task-specific data needed to match the predictive accuracy of the LLM. We estimate this measure by comparing the prediction error of a fixed LLM in a given domain to that of flexible machine learning models trained on increasing samples of domain-specific data. We further provide a statistical inference procedure by developing a new asymptotic theory for cross-validated prediction error. Finally, we apply this method to the Panel Study of Income Dynamics. We find that LLMs encode considerable predictive information for some economic variables but much less for others, suggesting that their value as substitutes for domain-specific data differs markedly across settings.
翻译:大型语言模型(LLMs)正日益被用于预测人类行为。我们提出了一种度量方法,用于评估预训练LLM为此类预测带来的知识量:其等效样本量,定义为达到与LLM预测准确度相匹配所需的任务特定数据量。我们通过比较固定LLM在特定领域的预测误差与在不断增加领域特定数据样本上训练得到的灵活机器学习模型的预测误差,来估计这一度量。我们进一步通过为交叉验证预测误差发展一种新的渐近理论,提供了一种统计推断方法。最后,我们将此方法应用于收入动态追踪研究。我们发现,LLMs对某些经济变量编码了相当多的预测信息,但对其他变量的预测信息则少得多,这表明它们作为领域特定数据替代品的价值在不同情境中存在显著差异。