Prevalent supervised learning methods in natural language processing (NLP) are notoriously data-hungry, which demand large amounts of high-quality annotated data. In practice, acquiring such data is a costly endeavor. Recently, the superior few-shot performance of large language models (LLMs) has propelled the development of dataset generation, where the training data are solely synthesized from LLMs. However, such an approach usually suffers from low-quality issues, and requires orders of magnitude more labeled data to achieve satisfactory performance. To fully exploit the potential of LLMs and make use of massive unlabeled data, we propose LLMaAA, which takes LLMs as annotators and puts them into an active learning loop to determine what to annotate efficiently. To learn robustly with pseudo labels, we optimize both the annotation and training processes: (1) we draw k-NN examples from a small demonstration pool as in-context examples, and (2) we adopt the example reweighting technique to assign training samples with learnable weights. Compared with previous approaches, LLMaAA features both efficiency and reliability. We conduct experiments and analysis on two classic NLP tasks, named entity recognition and relation extraction. With LLMaAA, task-specific models trained from LLM-generated labels can outperform the teacher within only hundreds of annotated examples, which is much more cost-effective than other baselines.
翻译:自然语言处理(NLP)中普遍采用的监督学习方法以数据饥渴著称,需要大量高质量标注数据。然而在实践中,获取此类数据成本高昂。近期,大语言模型(LLMs)卓越的小样本性能推动了数据集生成技术的发展——训练数据完全由LLMs合成。但这类方法通常存在质量低下的问题,且需要数量级更多的标注数据才能取得令人满意的性能。为充分挖掘LLMs的潜力并利用海量未标注数据,我们提出LLMaAA方法,将LLMs作为标注者纳入主动学习循环,以高效决策需要标注的内容。为利用伪标签实现稳健学习,我们对标注与训练过程进行了双重优化:(1)从少量样本演示池中抽取k近邻(k-NN)样本作为上下文示例;(2)采用样本重加权技术为训练样本赋予可学习权重。相较于既有方法,LLMaAA兼具高效性与可靠性。我们在命名实体识别和关系抽取两项经典NLP任务上开展了实验与分析。实验表明,采用LLMaAA方法,基于LLMs生成标签训练的专用模型仅需数百个标注示例即可超越教师模型,其成本效益远超其他基线方法。