Currently, there are thousands of large pretrained language models (LLMs) available to social scientists. How do we select among them? Using validity, reliability, reproducibility, and replicability as guides, we explore the significance of: (1) model openness, (2) model footprint, (3) training data, and (4) model architectures and fine-tuning. While ex-ante tests of validity (i.e., benchmarks) are often privileged in these discussions, we argue that social scientists cannot altogether avoid validating computational measures (ex-post). Replicability, in particular, is a more pressing guide for selecting language models. Being able to reliably replicate a particular finding that entails the use of a language model necessitates reliably reproducing a task. To this end, we propose starting with smaller, open models, and constructing delimited benchmarks to demonstrate the validity of the entire computational pipeline.
翻译:当前,社会科学研究者面临着数千种可用的大型预训练语言模型(LLMs)。我们应如何从中进行选择?以效度、信度、可重现性与可复制性为指导原则,本文探讨了以下因素的重要性:(1) 模型的开源性,(2) 模型规模,(3) 训练数据,以及(4) 模型架构与微调。尽管在这些讨论中,对效度的先验检验(即基准测试)常常被优先考虑,但我们认为社会科学研究者无法完全避免对计算测量进行后验验证。特别是,可复制性成为选择语言模型时更为紧迫的指导原则。要可靠地复制一个涉及使用语言模型的特定研究发现,必须能够可靠地重现其任务。为此,我们建议从较小的开源模型入手,并构建限定范围的基准测试,以证明整个计算流程的效度。