This study evaluates the ability of Large Language Model (LLM)-based Subpopulation Representative Models (SRMs) to generalize from empirical data, utilizing in-context learning with data from the 2016 and 2020 American National Election Studies. We explore generalization across response variables and demographic subgroups. While conditioning with empirical data improves performance on the whole, the benefit of in-context learning varies considerably across demographics, sometimes hurting performance for one demographic while helping performance for others. The inequitable benefits of in-context learning for SRM present a challenge for practitioners implementing SRMs, and for decision-makers who might come to rely on them. Our work highlights a need for fine-grained benchmarks captured from diverse subpopulations that test not only fidelity but generalization.
翻译:本研究评估了基于大语言模型的亚群代表模型从经验数据中泛化的能力,利用2016年与2020年美国全国选举研究的数据进行情境学习。我们探索了在响应变量和人口统计亚群上的泛化表现。虽然基于经验数据的条件设置整体上提升了模型性能,但情境学习的收益在不同人口统计群体间差异显著,有时会提升某一群体的表现而损害其他群体。情境学习在亚群代表模型中的非均衡效益,对实施该模型的实践者以及可能依赖该模型的决策者构成了挑战。本研究揭示了需要从多样化亚群中捕获细粒度基准测试的需求,这些基准不仅要检验保真度,还需评估泛化能力。