Utilizing language models (LMs) without internal access is becoming an attractive paradigm in the field of NLP as many cutting-edge LMs are released through APIs and boast a massive scale. The de-facto method in this type of black-box scenario is known as prompting, which has shown progressive performance enhancements in situations where data labels are scarce or unavailable. Despite their efficacy, they still fall short in comparison to fully supervised counterparts and are generally brittle to slight modifications. In this paper, we propose Clustering-enhanced Linear Discriminative Analysis, a novel approach that improves the text classification accuracy with a very weak-supervision signal (i.e., name of the labels). Our framework draws a precise decision boundary without accessing weights or gradients of the LM model or data labels. The core ideas of CELDA are twofold: (1) extracting a refined pseudo-labeled dataset from an unlabeled dataset, and (2) training a lightweight and robust model on the top of LM, which learns an accurate decision boundary from an extracted noisy dataset. Throughout in-depth investigations on various datasets, we demonstrated that CELDA reaches new state-of-the-art in weakly-supervised text classification and narrows the gap with a fully-supervised model. Additionally, our proposed methodology can be applied universally to any LM and has the potential to scale to larger models, making it a more viable option for utilizing large LMs.
翻译:利用无法获取内部访问权限的语言模型(LM)正成为NLP领域一种有吸引力的范式,因为许多尖端LM通过API发布且规模庞大。在此类黑箱场景中,事实标准方法称为提示学习,它在数据标签稀缺或不可用的情况下展现出渐进式性能提升。尽管提示学习有效,但其性能仍逊于全监督方法,且对细微改动普遍敏感。本文提出聚类增强线性判别分析,这是一种新颖方法,通过极弱监督信号(即标签名称)提升文本分类精度。我们的框架无需访问LM模型的权重、梯度或数据标签即可绘制精确决策边界。CELDA的核心思想有两方面:(1) 从无标签数据集中提取精炼的伪标签数据集,(2) 在LM之上训练轻量级鲁棒模型,该模型能从含噪声数据集中学习精确决策边界。通过对多个数据集的深入探究,我们证明CELDA在弱监督文本分类中达到了新的最优性能,并缩小了与全监督模型的差距。此外,所提方法可通用适配任何LM,并具备向更大模型扩展的潜力,从而成为利用大型LM的更可行方案。