We show that the use of large language models (LLMs) is prevalent among crowd workers, and that targeted mitigation strategies can significantly reduce, but not eliminate, LLM use. On a text summarization task where workers were not directed in any way regarding their LLM use, the estimated prevalence of LLM use was around 30%, but was reduced by about half by asking workers to not use LLMs and by raising the cost of using them, e.g., by disabling copy-pasting. Secondary analyses give further insight into LLM use and its prevention: LLM use yields high-quality but homogeneous responses, which may harm research concerned with human (rather than model) behavior and degrade future models trained with crowdsourced data. At the same time, preventing LLM use may be at odds with obtaining high-quality responses; e.g., when requesting workers not to use LLMs, summaries contained fewer keywords carrying essential information. Our estimates will likely change as LLMs increase in popularity or capabilities, and as norms around their usage change. Yet, understanding the co-evolution of LLM-based tools and users is key to maintaining the validity of research done using crowdsourcing, and we provide a critical baseline before widespread adoption ensues.
翻译:我们研究表明,大型语言模型(LLMs)在众包工作者中的使用率很高,而针对性的缓解策略可以显著减少但不能完全消除LLM的使用。在一项文本摘要任务中(工作者未被告知任何关于LLM使用的规定),LLM的估计使用率约为30%,但通过要求工作者不使用LLM以及增加使用成本(例如禁用复制粘贴功能),这一比例降低了约一半。进一步分析揭示了LLM使用及其预防的新见解:LLM使用能产生高质量但同质化的响应,这可能损害关注人类行为(而非模型行为)的研究,并降低未来使用众包数据训练的模型质量。与此同时,防止LLM使用可能与获得高质量响应相悖——例如,当要求工作者不使用LLM时,生成的摘要包含的关键信息关键词更少。随着LLM普及率或能力的提升,以及使用规范的演变,我们的估计结果可能会发生变化。然而,理解基于LLM的工具与用户的共同演化,对于维护使用众包开展研究的有效性至关重要,我们在大规模采用之前提供了关键基准。