Today, using Large-scale generative Language Models (LLMs) it is possible to simulate free responses to interview questions like those traditionally analyzed using qualitative research methods. Qualitative methodology encompasses a broad family of techniques involving manual analysis of open-ended interviews or conversations conducted freely in natural language. Here we consider whether artificial "silicon participants" generated by LLMs may be productively studied using qualitative methods aiming to produce insights that could generalize to real human populations. The key concept in our analysis is algorithmic fidelity, a term introduced by Argyle et al. (2023) capturing the degree to which LLM-generated outputs mirror human sub-populations' beliefs and attitudes. By definition, high algorithmic fidelity suggests latent beliefs elicited from LLMs may generalize to real humans, whereas low algorithmic fidelity renders such research invalid. Here we used an LLM to generate interviews with silicon participants matching specific demographic characteristics one-for-one with a set of human participants. Using framework-based qualitative analysis, we showed the key themes obtained from both human and silicon participants were strikingly similar. However, when we analyzed the structure and tone of the interviews we found even more striking differences. We also found evidence of the hyper-accuracy distortion described by Aher et al. (2023). We conclude that the LLM we tested (GPT-3.5) does not have sufficient algorithmic fidelity to expect research on it to generalize to human populations. However, the rapid pace of LLM research makes it plausible this could change in the future. Thus we stress the need to establish epistemic norms now around how to assess validity of LLM-based qualitative research, especially concerning the need to ensure representation of heterogeneous lived experiences.
翻译:如今,使用大规模生成式语言模型(LLM)可以模拟对访谈问题的自由回答,这类回答传统上采用定性研究方法进行分析。定性方法论涵盖一系列技术手段,涉及对手动分析以自然语言自由进行的开放式访谈或对话。本文探讨是否可以利用定性方法有效研究由LLM生成的人工"硅基参与者",以期获得可推广至真实人类群体的洞见。我们分析的核心概念是算法保真度——这一术语由Argyle等人(2023)提出,用于衡量LLM输出结果在多大程度上反映人类亚群体的信念和态度。根据定义,高算法保真度意味着从LLM中提取的潜在信念可能泛化至真实人类,而低算法保真度则会使此类研究失效。本研究使用LLM生成与一组人类参与者一一对应、匹配特定人口统计学特征的硅基参与者访谈。通过基于框架的定性分析,我们发现人类与硅基参与者获得的关键主题惊人相似。然而,在分析访谈结构与语气时,我们发现了更为显著的差异,并观察到Aher等人(2023)所描述的超准确畸变现象。我们得出结论:所测试的LLM(GPT-3.5)不具备足够的算法保真度以确保基于其的研究能推广至人类群体。但LLM研究的快速发展使这一现状在未来可能发生改变。因此,我们强调现在亟需建立认知规范,以评估基于LLM的定性研究的有效性,尤其需确保多元化生活经验的表征。