Data generation-based zero-shot learning, although effective in training Small Task-specific Models (STMs) via synthetic datasets generated by Pre-trained Language Models (PLMs), is often limited by the low quality of such synthetic datasets. Previous solutions have primarily focused on single PLM settings, where synthetic datasets are typically restricted to specific sub-spaces and often deviate from real-world distributions, leading to severe distribution bias. To mitigate such bias, we propose FuseGen, a novel data generation-based zero-shot learning framework that introduces a new criteria for subset selection from synthetic datasets via utilizing multiple PLMs and trained STMs. The chosen subset provides in-context feedback to each PLM, enhancing dataset quality through iterative data generation. Trained STMs are then used for sample re-weighting as well, further improving data quality. Extensive experiments across diverse tasks demonstrate that FuseGen substantially outperforms existing methods, highly effective in boosting STM performance in a PLM-agnostic way. Code is provided in https://github.com/LindaLydia/FuseGen.
翻译:基于数据生成的零样本学习方法,尽管能通过预训练语言模型生成的合成数据集来训练小型任务专用模型,但常受限于此类合成数据集质量较低的问题。现有方案主要集中于单一预训练语言模型的设定,其生成的合成数据集通常局限于特定子空间,且往往偏离真实数据分布,导致严重的分布偏差。为缓解此类偏差,本文提出FuseGen——一种新颖的基于数据生成的零样本学习框架,该框架通过利用多个预训练语言模型与已训练的小型任务专用模型,构建了从合成数据集中进行子集筛选的新准则。所选子集通过上下文反馈机制作用于每个预训练语言模型,借助迭代式数据生成提升数据集质量。已训练的小型任务专用模型同时用于样本重加权,进一步优化数据质量。跨多种任务的广泛实验表明,FuseGen显著优于现有方法,能以与预训练语言模型无关的方式有效提升小型任务专用模型的性能。代码发布于 https://github.com/LindaLydia/FuseGen。