There is a rising interest in further exploring the zero-shot learning potential of large pre-trained language models (PLMs). A new paradigm called data-generation-based zero-shot learning has achieved impressive success. In this paradigm, the synthesized data from the PLM acts as the carrier of knowledge, which is used to train a task-specific model with orders of magnitude fewer parameters than the PLM, achieving both higher performance and efficiency than prompt-based zero-shot learning methods on PLMs. The main hurdle of this approach is that the synthesized data from PLM usually contains a significant portion of low-quality samples. Fitting on such data will greatly hamper the performance of the task-specific model, making it unreliable for deployment. Previous methods remedy this issue mainly by filtering synthetic data using heuristic metrics(e.g., output confidence), or refining the data with the help of a human expert, which comes with excessive manual tuning or expensive costs. In this paper, we propose a novel noise-robust re-weighting framework SunGen to automatically construct high-quality data for zero-shot classification problems. Our framework features the ability to learn the sample weights indicating data quality without requiring any human annotation. We theoretically and empirically verify the ability of our method to help construct good-quality synthetic datasets. Notably, SunGen-LSTM yields a 9.8% relative improvement than the baseline on average accuracy across eight different established text classification tasks.
翻译:大规模预训练语言模型(PLMs)的零样本学习潜力正引发越来越多的关注。一种名为基于数据生成的零样本学习新范式已取得显著成功。在该范式中,从PLM合成得到的数据作为知识载体,用于训练参数规模比PLM小数个数量级的任务特定模型,从而在性能与效率上均优于基于提示的PLM零样本学习方法。该范式的主要障碍在于PLM合成数据中通常包含大量低质量样本,基于此类数据进行拟合会严重损害任务特定模型的性能,使其难以可靠部署。现有方法主要通过启发式指标(如输出置信度)过滤合成数据,或借助人类专家优化数据来解决这一问题,但前者需要大量人工调参,后者则成本高昂。本文提出一种新颖的噪声鲁棒重加权框架SunGen,能够为零样本分类问题自动构建高质量数据。该框架的核心能力在于无需任何人工标注即可学习表征数据质量的样本权重。我们从理论与实证两方面验证了该方法有助于构建高质量合成数据集的能力。值得注意的是,在八项不同基准文本分类任务中,SunGen-LSTM的平均准确率相较基线方法取得了9.8%的相对提升。