State-of-the-art supervised NLP models achieve high accuracy but are also susceptible to failures on inputs from low-data regimes, such as domains that are not represented in training data. As an approximation to collecting ground-truth labels for the specific domain, we study the use of large language models (LLMs) for annotating inputs and improving the generalization of NLP models. Specifically, given a budget for LLM annotations, we present an algorithm for sampling the most informative inputs to annotate and retrain the NLP model. We find that popular active learning strategies such as uncertainty-based sampling do not work well. Instead, we propose a sampling strategy based on the difference in prediction scores between the base model and the finetuned NLP model, utilizing the fact that most NLP models are finetuned from a base model. Experiments with classification (semantic similarity) and ranking (semantic search) tasks show that our sampling strategy leads to significant gains in accuracy for both the training and target domains.
翻译:当前最先进的监督式自然语言处理(NLP)模型虽能达到高准确率,但在训练数据未覆盖的低资源领域(如某些稀缺域)的输入上仍易出现错误。作为获取特定领域真实标注的近似方案,本研究探索利用大语言模型(LLMs)进行输入标注,以提升NLP模型的泛化能力。具体而言,在给定LLM标注预算约束下,我们提出一种算法用于采样最具信息量的输入进行标注并重新训练NLP模型。研究发现,传统主动学习策略(如基于不确定性的采样)效果不佳。为此,我们提出了一种基于基模型与微调后NLP模型预测分数差异的采样策略,充分利用了多数NLP模型从基模型微调而来的特性。在分类(语义相似度)与排序(语义搜索)任务上的实验表明,该采样策略在训练域与目标域上均能带来显著的准确率提升。