Large language models (LLMs) are remarkable data annotators. They can be used to generate high-fidelity supervised training data, as well as survey and experimental data. With the widespread adoption of LLMs, human gold--standard annotations are key to understanding the capabilities of LLMs and the validity of their results. However, crowdsourcing, an important, inexpensive way to obtain human annotations, may itself be impacted by LLMs, as crowd workers have financial incentives to use LLMs to increase their productivity and income. To investigate this concern, we conducted a case study on the prevalence of LLM usage by crowd workers. We reran an abstract summarization task from the literature on Amazon Mechanical Turk and, through a combination of keystroke detection and synthetic text classification, estimate that 33-46% of crowd workers used LLMs when completing the task. Although generalization to other, less LLM-friendly tasks is unclear, our results call for platforms, researchers, and crowd workers to find new ways to ensure that human data remain human, perhaps using the methodology proposed here as a stepping stone. Code/data: https://github.com/epfl-dlab/GPTurk
翻译:大语言模型(LLM)是卓越的数据标注工具,可用于生成高保真度的监督训练数据,以及调查与实验数据。随着LLM的广泛采用,人类金标准注释对理解LLM的能力及其结果的可靠性至关重要。然而,众包作为获取人类注释的重要低成本途径,其本身可能受到LLM的影响——众包工作者有经济动机使用LLM来提高生产效率和收入。为探究这一担忧,我们对众包工作者使用LLM的普遍性进行了案例研究。我们在Amazon Mechanical Turk平台上重复了文献中的一项摘要总结任务,结合击键检测与合成文本分类方法,估计33%-46%的众包工作者在完成该任务时使用了LLM。尽管该结果能否推广至其他对LLM较不友好的任务尚不明确,但我们的研究呼吁平台、研究人员及众包工作者共同探索新方法,以保障人类数据的"人性"本质——或许可将本文提出的方法论作为切入点。代码与数据:https://github.com/epfl-dlab/GPTurk