With contrastive pre-training, sentence encoders are generally optimized to locate semantically similar samples closer to each other in their embedding spaces. In this work, we focus on the potential of their embedding spaces to be readily adapted to zero-shot text classification, as semantically distinct samples are already well-separated. Our framework, RaLP (Retrieval augmented Label Prompts for sentence encoder), encodes prompted label candidates with a sentence encoder, then assigns the label whose prompt embedding has the highest similarity with the input text embedding. In order to compensate for the potentially poorly descriptive labels in their original format, RaLP retrieves sentences that are semantically similar to the original label prompt from external corpora and use them as additional pseudo-label prompts. RaLP achieves competitive or stronger performance than much larger baselines on various closed-set classification and multiple-choice QA datasets under zero-shot settings. We show that the retrieval component plays a pivotal role in RaLP's success, and its results are robustly attained regardless of verbalizer variations.
翻译:基于对比预训练,句子编码器通常被优化为在其嵌入空间中将语义相似的样本更紧密地聚拢。本文聚焦于这些嵌入空间在零样本文本分类中的潜在适应性,因为语义不同的样本已实现良好分离。我们的框架RaLP(检索增强标签提示用于句子编码器)通过句子编码器对提示后的候选标签进行编码,然后将与输入文本嵌入相似度最高的提示嵌入对应的标签作为输出。为补偿原始格式下可能描述性不足的标签,RaLP从外部语料库中检索与原始标签提示语义相似的句子,并将其作为额外的伪标签提示。在零样本设置下,RaLP在多种闭集分类及多项选择问答数据集上取得了与规模更大的基线模型相当甚至更优的性能。研究表明,检索组件在RaLP的成功中发挥关键作用,且其效果在不同语言表达器变体下均能稳定实现。