The advent of AI driven large language models (LLMs) have stirred discussions about their role in qualitative research. Some view these as tools to enrich human understanding, while others perceive them as threats to the core values of the discipline. This study aimed to compare and contrast the comprehension capabilities of humans and LLMs. We conducted an experiment with small sample of Alexa app reviews, initially classified by a human analyst. LLMs were then asked to classify these reviews and provide the reasoning behind each classification. We compared the results with human classification and reasoning. The research indicated a significant alignment between human and ChatGPT 3.5 classifications in one third of cases, and a slightly lower alignment with GPT4 in over a quarter of cases. The two AI models showed a higher alignment, observed in more than half of the instances. However, a consensus across all three methods was seen only in about one fifth of the classifications. In the comparison of human and LLMs reasoning, it appears that human analysts lean heavily on their individual experiences. As expected, LLMs, on the other hand, base their reasoning on the specific word choices found in app reviews and the functional components of the app itself. Our results highlight the potential for effective human LLM collaboration, suggesting a synergistic rather than competitive relationship. Researchers must continuously evaluate LLMs role in their work, thereby fostering a future where AI and humans jointly enrich qualitative research.
翻译:以AI驱动的大型语言模型(LLMs)的出现引发了关于其在定性研究中角色的讨论。一些人认为这些工具能丰富人类理解,而另一些人则视其为对该学科核心价值的威胁。本研究旨在比较人类与LLMs的理解能力。我们以少量Alexa应用评论为样本开展实验,初始分类由人类分析师完成。随后要求LLMs对这些评论进行分类并解释分类依据,将结果与人类分类及推理进行对比。研究表明:在三分之一的案例中,人类与ChatGPT 3.5的分类结果显著一致;GPT4在超过四分之一的案例中一致性略低;两个AI模型在超过半数案例中展现出更高的一致性。然而,三种方法仅在约五分之一的分类中达成共识。在人类与LLMs推理方式的对比中,人类分析师明显依赖个人经验,而LLMs则基于应用评论中的具体措辞及应用本身功能组件进行推理。我们的结果凸显了人类与LLMs有效协作的潜力,表明二者更应形成协同而非竞争关系。研究者需持续评估LLMs在其工作中的角色,以推动人工智能与人类共同丰富定性研究的未来。