A universal classification model aims to generalize to diverse classification tasks in both zero and few shot settings. A promising way toward universal classification is to cast heterogeneous data formats into a dataset-agnostic "meta-task" (e.g., textual entailment, question answering) then pretrain a model on the combined meta dataset. The existing work is either pretrained on specific subsets of classification tasks, or pretrained on both classification and generation data but the model could not fulfill its potential in universality and reliability. These also leave a massive amount of annotated data under-exploited. To fill these gaps, we propose ConEntail, a new framework for universal zero and few shot classification with supervised contrastive pretraining. Our unified meta-task for classification is based on nested entailment. It can be interpreted as "Does sentence a entails [sentence b entails label c]". This formulation enables us to make better use of 57 annotated classification datasets for supervised contrastive pretraining and universal evaluation. In this way, ConEntail helps the model (1) absorb knowledge from different datasets, and (2) gain consistent performance gain with more pretraining data. In experiments, we compare our model with discriminative and generative models pretrained on the same dataset. The results confirm that our framework effectively exploits existing annotated data and consistently outperforms baselines in both zero (9.4% average improvement) and few shot settings (3.5% average improvement).
翻译:通用分类模型旨在零样本及少样本场景下泛化至多种分类任务。实现通用分类的有效途径是将异构数据格式转化为与数据集无关的"元任务"(如文本蕴含、问答),随后在合并的元数据集上预训练模型。现有工作或仅在特定分类任务子集上进行预训练,或混合分类与生成数据预训练但模型在通用性与可靠性方面未能充分发挥潜力,且大量标注数据未被充分利用。为弥补上述不足,我们提出ConEntail——一种基于监督对比预训练的通用零样本与少样本分类框架。我们基于嵌套蕴含构建统一的分类元任务,可解读为"句子a是否蕴含[句子b蕴含标签c]"的语义关系。该公式使我们能有效利用57个标注分类数据集进行监督对比预训练与通用评估。通过该方式,ConEntail帮助模型:(1)吸收不同数据集的知识;(2)随预训练数据增加获得一致的性能提升。实验环节,我们将本模型与基于相同数据集预训练的判别式及生成式模型对比。结果表明,我们的框架能有效利用现有标注数据,在零样本(平均提升9.4%)与少样本(平均提升3.5%)设定下均持续超越基线模型。