One of the most reliable ways to create deployable models for specialized tasks is to obtain an adequate amount of high-quality task-specific data. However, for specialized tasks, often such datasets do not exist. Existing methods address this by creating such data from large language models (LLMs) and then distilling such knowledge into smaller models. However, these methods are limited by the quality of the LLMs output, and tend to generate repetitive or incorrect data. In this work, we present Retrieval Based Distillation (ReBase), a method that first retrieves data from rich online sources and then transforms them into domain-specific data. This method greatly enhances data diversity. Moreover, ReBase generates Chain-of-Thought reasoning and distills the reasoning capacity of LLMs. We test our method on 4 benchmarks and results show that our method significantly improves performance by up to 7.8% on SQuAD, 1.37% on MNLI, and 1.94% on BigBench-Hard.
翻译:为特定任务创建可部署模型最可靠的方法之一是获取足够数量的高质量任务专用数据。然而,对于专业化任务,此类数据集往往并不存在。现有方法通过从大型语言模型(LLMs)生成此类数据,然后将这些知识蒸馏到较小模型中来解决此问题。然而,这些方法受限于LLMs输出的质量,往往生成重复或不正确的数据。在本工作中,我们提出了基于检索的蒸馏(ReBase),该方法首先从丰富的在线资源中检索数据,然后将其转化为领域专用数据。该方法极大地增强了数据多样性。此外,ReBase生成思维链推理,并蒸馏LLMs的推理能力。我们在4个基准测试上验证了我们的方法,结果表明,我们的方法在SQuAD上性能显著提升达7.8%,在MNLI上提升1.37%,在BigBench-Hard上提升1.94%。