With the development of large language models (LLMs), zero-shot learning has attracted much attention for various NLP tasks. Different from prior works that generate training data with billion-scale natural language generation (NLG) models, we propose a retrieval-enhanced framework to create training data from a general-domain unlabeled corpus. To realize this, we first conduct contrastive pretraining to learn an unsupervised dense retriever for extracting the most relevant documents using class-descriptive verbalizers. We then further propose two simple strategies, namely Verbalizer Augmentation with Demonstrations and Self-consistency Guided Filtering to improve the topic coverage of the dataset while removing noisy examples. Experiments on nine datasets demonstrate that REGEN achieves 4.3% gain over the strongest baselines and saves around 70% of the time compared to baselines using large NLG models. Besides, REGEN can be naturally integrated with recently proposed large language models to boost performance.
翻译:随着大型语言模型(LLMs)的发展,零样本学习在各种自然语言处理任务中受到广泛关注。与以往使用十亿级自然语言生成(NLG)模型生成训练数据的工作不同,我们提出了一种检索增强框架,从通用域未标注语料库中创建训练数据。为实现这一目标,我们首先进行对比预训练,学习一个无监督的稠密检索器,利用类别描述性词汇器提取最相关的文档。随后,我们进一步提出两种简单策略,即带有演示的词汇器增强和自一致性引导过滤,以提高数据集的主题覆盖范围,同时去除噪声样本。在九个数据集上的实验表明,REGEN相比最强基线获得了4.3%的提升,并且与使用大型NLG模型的基线相比,节省了约70%的时间。此外,REGEN可自然集成到近期提出的大型语言模型中,以提升性能。