Chatbots have shown promise as tools to scale qualitative data collection. Recent advances in Large Language Models (LLMs) could accelerate this process by allowing researchers to easily deploy sophisticated interviewing chatbots. We test this assumption by conducting a large-scale user study (n=399) evaluating 3 different chatbots, two of which are LLM-based and a baseline which employs hard-coded questions. We evaluate the results with respect to participant engagement and experience, established metrics of chatbot quality grounded in theories of effective communication, and a novel scale evaluating "richness" or the extent to which responses capture the complexity and specificity of the social context under study. We find that, while the chatbots were able to elicit high-quality responses based on established evaluation metrics, the responses rarely capture participants' specific motives or personalized examples, and thus perform poorly with respect to richness. We further find low inter-rater reliability between LLMs and humans in the assessment of both quality and richness metrics. Our study offers a cautionary tale for scaling and evaluating qualitative research with LLMs.
翻译:聊天机器人作为规模化收集定性数据的工具已展现出潜力。大型语言模型(LLMs)的最新进展可能加速这一进程,使研究人员能够轻松部署复杂的访谈聊天机器人。我们通过开展一项大规模用户研究(n=399)来检验这一假设,评估了3种不同的聊天机器人,其中两种基于LLM,另一种采用硬编码问题作为基线。我们从参与者参与度和体验、基于有效沟通理论确立的聊天机器人质量指标,以及一个评估“丰富性”(即回答在多大程度上捕捉到所研究社会背景的复杂性和具体性)的新颖量表等多个维度对结果进行评估。我们发现,虽然这些聊天机器人能够基于既定评估指标引发高质量的回答,但这些回答很少能捕捉参与者的具体动机或个性化案例,因此在丰富性方面表现不佳。我们还发现,在质量和丰富性指标的评估上,LLM与人类评分者之间的评分者间信度较低。我们的研究为利用LLM进行规模化定性研究及其评估提供了一个警示性案例。