It is often desirable to distill the capabilities of large language models (LLMs) into smaller student models due to compute and memory constraints. One way to do this for classification tasks is via dataset synthesis, which can be accomplished by generating examples of each label from the LLM. Prior approaches to synthesis use few-shot prompting, which relies on the LLM's parametric knowledge to generate usable examples. However, this leads to issues of repetition, bias towards popular entities, and stylistic differences from human text. In this work, we propose Synthesize by Retrieval and Refinement (SynthesizRR), which uses retrieval augmentation to introduce variety into the dataset synthesis process: as retrieved passages vary, the LLM is seeded with different content to generate its examples. We empirically study the synthesis of six datasets, covering topic classification, sentiment analysis, tone detection, and humor, requiring complex synthesis strategies. We find that SynthesizRR greatly improves lexical and semantic diversity, similarity to human-written text, and distillation performance, when compared to 32-shot prompting and four prior approaches. We release our code to perform all steps at https://github.com/amazon-science/synthesizrr
翻译:由于计算和内存限制,通常期望将大语言模型(LLM)的能力提炼到更小的学生模型中。对于分类任务,一种实现方式是通过数据集合成,即利用LLM生成每个标签的示例。现有的合成方法多采用少样本提示,依赖LLM的参数化知识来生成可用示例。然而,这会导致重复性高、偏向流行实体以及与人类文本风格差异等问题。本研究提出基于检索与精炼的合成方法(SynthesizRR),该方法通过检索增强为数据集合成过程引入多样性:随着检索到的文本段落变化,LLM基于不同的内容种子生成示例。我们通过实验对六个数据集进行合成研究,涵盖主题分类、情感分析、语气检测和幽默识别等需要复杂合成策略的任务。与32样本提示及四种现有方法相比,SynthesizRR在词汇与语义多样性、与人类文本的相似性以及蒸馏性能方面均表现出显著提升。相关实现代码已发布于 https://github.com/amazon-science/synthesizrr。