Low-resource languages face significant barriers in AI development due to limited linguistic resources and expertise for data labeling, rendering them rare and costly. The scarcity of data and the absence of preexisting tools exacerbate these challenges, especially since these languages may not be adequately represented in various NLP datasets. To address this gap, we propose leveraging the potential of LLMs in the active learning loop for data annotation. Initially, we conduct evaluations to assess inter-annotator agreement and consistency, facilitating the selection of a suitable LLM annotator. The chosen annotator is then integrated into a training loop for a classifier using an active learning paradigm, minimizing the amount of queried data required. Empirical evaluations, notably employing GPT-4-Turbo, demonstrate near-state-of-the-art performance with significantly reduced data requirements, as indicated by estimated potential cost savings of at least 42.45 times compared to human annotation. Our proposed solution shows promising potential to substantially reduce both the monetary and computational costs associated with automation in low-resource settings. By bridging the gap between low-resource languages and AI, this approach fosters broader inclusion and shows the potential to enable automation across diverse linguistic landscapes.
翻译:低资源语言由于标注数据所需的语言资源和专业知识有限,在人工智能发展中面临显著障碍,导致标注数据稀缺且成本高昂。数据匮乏和现有工具缺失加剧了这些挑战,特别是因为这些语言在各种自然语言处理数据集中可能未能得到充分表征。为弥补这一差距,我们提出在主动学习循环中利用大语言模型进行数据标注的潜力。首先,我们通过评估标注者间一致性与稳定性来选择合适的LLM标注器。随后将选定的标注器集成到基于主动学习范式的分类器训练循环中,从而最小化所需查询的数据量。实证评估(特别是采用GPT-4-Turbo)表明,该方法在显著减少数据需求的同时实现了接近最先进的性能,据估计相较于人工标注至少可节省42.45倍潜在成本。我们提出的解决方案展现出大幅降低低资源环境下自动化所需经济成本与计算成本的巨大潜力。通过弥合低资源语言与人工智能之间的鸿沟,该方法促进了更广泛的包容性,并显示出在不同语言环境中实现自动化的潜力。