Large Language Models (LLMs) have emerged as powerful support tools across various natural language tasks and a range of application domains. Recent studies focus on exploring their capabilities for data annotation. This paper provides a comparative overview of twelve studies investigating the potential of LLMs in labelling data. While the models demonstrate promising cost and time-saving benefits, there exist considerable limitations, such as representativeness, bias, sensitivity to prompt variations and English language preference. Leveraging insights from these studies, our empirical analysis further examines the alignment between human and GPT-generated opinion distributions across four subjective datasets. In contrast to the studies examining representation, our methodology directly obtains the opinion distribution from GPT. Our analysis thereby supports the minority of studies that are considering diverse perspectives when evaluating data annotation tasks and highlights the need for further research in this direction.
翻译:[translated abstract in Chinese]
大型语言模型(LLMs)已成为各类自然语言任务及多个应用领域中强大的辅助工具。近期研究聚焦于探索其在数据标注中的能力。本文对十二项关于LLMs标注数据潜力的研究进行了比较性综述。尽管这些模型展现出显著的成本与时间节省优势,但仍存在诸多局限,例如代表性不足、偏倚问题、对提示变体的敏感性以及英语语言偏好。基于上述研究的洞见,我们的实证分析进一步检验了在四个主观数据集上人类与GPT生成的意见分布之间的一致性。与现有表征研究不同,我们的方法直接从GPT获取意见分布。因此,本分析支持了少数在评估数据标注任务时考虑多元视角的研究,并凸显了在该方向开展进一步研究的必要性。