A major challenge in Natural Language Processing is obtaining annotated data for supervised learning. An option is the use of crowdsourcing platforms for data annotation. However, crowdsourcing introduces issues related to the annotator's experience, consistency, and biases. An alternative is to use zero-shot methods, which in turn have limitations compared to their few-shot or fully supervised counterparts. Recent advancements driven by large language models show potential, but struggle to adapt to specialized domains with severely limited data. The most common approaches therefore involve the human itself randomly annotating a set of datapoints to build initial datasets. But randomly sampling data to be annotated is often inefficient as it ignores the characteristics of the data and the specific needs of the model. The situation worsens when working with imbalanced datasets, as random sampling tends to heavily bias towards the majority classes, leading to excessive annotated data. To address these issues, this paper contributes an automatic and informed data selection architecture to build a small dataset for few-shot learning. Our proposal minimizes the quantity and maximizes diversity of data selected for human annotation, while improving model performance.
翻译:自然语言处理中的主要挑战之一是获取用于监督学习的标注数据。一种选择是使用众包平台进行数据标注。然而,众包会引入与标注者经验、一致性和偏见相关的问题。另一种替代方案是使用零样本方法,但该方法相较于少样本或全监督方法存在局限性。近期由大型语言模型驱动的进展展现出潜力,但难以适应数据极其稀缺的专业领域。因此,最常见的做法仍是由人类随机标注部分数据点来构建初始数据集。但随机采样待标注数据往往效率低下,因为它忽略了数据特征与模型的特定需求。当处理不平衡数据集时,情况会进一步恶化——随机采样极易偏向多数类,导致标注数据过剩。为解决这些问题,本文提出一种自动化的、有依据的数据选择架构,为少样本学习构建小型数据集。我们的方法在最小化人类标注数据量的同时最大化数据多样性,并提升模型性能。