Social scientists are now using large language models to create "silicon samples": synthetic datasets intended to stand in for human respondents. However, producing these samples requires many analytic choices, including model selection, sampling parameters, prompt format, and the amount of demographic or contextual information provided. Across two studies, I examine whether these choices materially affect correspondence between silicon samples and human data. In Study 1, I generated 252 silicon-sample configurations for a controlled case study using two social-psychological scales, evaluating whether configurations recovered participant rankings, response distributions, and between-scale correlations. Configurations varied substantially across all three criteria, and configurations that performed well on one dimension often performed poorly on another. In Study 2, I extended this analysis to a published silicon-sample use case by re-examining Argyle et al.'s (2023) Study 3 using 66 alternative configurations. Correlations between human and silicon association structures differed substantially across configurations, from r = .23 to r = .84. Taken together, the results from these studies demonstrate that different defensible configuration choices can materially alter conclusions about the fidelity of silicon samples. I call for greater attention to the threat of analytic flexibility in using silicon samples and outline strategies that researchers may adopt to reduce this threat.
翻译:社会科学家们现在正利用大语言模型创建“硅样本”:旨在替代人类受访者的合成数据集。然而,生成这些样本需要许多分析选择,包括模型选择、采样参数、提示格式以及提供的人口统计或上下文信息量。通过两项研究,我检验了这些选择是否实质性地影响硅样本与人类数据之间的一致性。在研究1中,我为一个受控案例研究生成了252个硅样本配置,该案例使用两个社会心理量表,评估这些配置是否能恢复参与者排名、回答分布以及量表间相关性。配置在所有三个标准上均表现出显著差异,且在一个维度上表现良好的配置往往在另一个维度上表现不佳。在研究2中,我将此分析扩展到一个已发表的硅样本使用案例,通过使用66种替代配置重新检验Argyle等人(2023)的研究3。人类与硅样本关联结构之间的相关性在不同配置间差异显著,从r = .23到r = .84。综合来看,这些研究的结果表明,不同的可辩护配置选择可能实质性地改变关于硅样本保真度的结论。我呼吁在使用硅样本时更关注分析灵活性的威胁,并概述了研究者可采用以减少这一威胁的策略。