Social scientists are now using large language models to create "silicon samples": synthetic datasets intended to stand in for human respondents. However, producing these samples requires many analytic choices, including model selection, sampling parameters, prompt format, and the amount of demographic or contextual information provided. Across two studies, I examine whether these choices materially affect correspondence between silicon samples and human data. In Study 1, I generated 252 silicon-sample configurations for a controlled case study using two social-psychological scales, evaluating whether configurations recovered participant rankings, response distributions, and between-scale correlations. Configurations varied substantially across all three criteria, and configurations that performed well on one dimension often performed poorly on another. In Study 2, I extended this analysis to a published silicon-sample use case by re-examining Argyle et al.'s (2023) Study 3 using 66 alternative configurations. Correlations between human and silicon association structures differed substantially across configurations, from r = .23 to r = .84. Taken together, the results from these studies demonstrate that different defensible configuration choices can materially alter conclusions about the fidelity of silicon samples. I call for greater attention to the threat of analytic flexibility in using silicon samples and outline strategies that researchers may adopt to reduce this threat.
翻译:社会科学家正利用大型语言模型创建“硅样本”:旨在替代人类受访者的合成数据集。然而,生成这些样本涉及诸多分析选择,包括模型选择、采样参数、提示格式以及所提供的人口统计学或上下文信息量。通过两项研究,我考察了这些选择是否实质性影响硅样本与人类数据之间的对应关系。在研究1中,我针对一项使用两种社会心理量表的受控案例研究生成了252种硅样本配置,评估配置能否恢复参与者排名、反应分布以及量表间相关性。所有三种评估标准下的配置差异显著,且在某一维度表现良好的配置往往在另一维度表现欠佳。在研究2中,我通过重新检验Argyle等人(2023)的研究3并采用66种替代配置,将分析扩展至已发表的硅样本应用案例。不同配置下人类与硅样本关联结构之间的相关性差异显著,范围从r=0.23到r=0.84。综合来看,这些研究结果表明,不同的可辩护配置选择可能实质性改变关于硅样本保真度的结论。我呼吁在运用硅样本时需更加关注分析灵活性的威胁,并概述研究者可采取措施以降低此种威胁。