Utilizing language models (LMs) without internal access is becoming an attractive paradigm in the field of NLP as many cutting-edge LMs are released through APIs and boast a massive scale. The de-facto method in this type of black-box scenario is known as prompting, which has shown progressive performance enhancements in situations where data labels are scarce or unavailable. Despite their efficacy, they still fall short in comparison to fully supervised counterparts and are generally brittle to slight modifications. In this paper, we propose Clustering-enhanced Linear Discriminative Analysis, a novel approach that improves the text classification accuracy with a very weak-supervision signal (i.e., name of the labels). Our framework draws a precise decision boundary without accessing weights or gradients of the LM model or data labels. The core ideas of CELDA are twofold: (1) extracting a refined pseudo-labeled dataset from an unlabeled dataset, and (2) training a lightweight and robust model on the top of LM, which learns an accurate decision boundary from an extracted noisy dataset. Throughout in-depth investigations on various datasets, we demonstrated that CELDA reaches new state-of-the-art in weakly-supervised text classification and narrows the gap with a fully-supervised model. Additionally, our proposed methodology can be applied universally to any LM and has the potential to scale to larger models, making it a more viable option for utilizing large LMs.
翻译:利用语言模型但无需内部访问正在成为自然语言处理领域的有吸引力的范式,因为许多尖端语言模型通过API发布且规模庞大。在此类黑盒场景中的主流方法被称为提示学习,其在数据标签稀缺或不可用的情况下展现出渐进式性能提升。尽管具有有效性,但它们仍逊色于完全监督方法,且通常对轻微修改较为脆弱。本文提出聚类增强线性判别分析,这是一种新颖方法,通过极弱监督信号(即标签名称)提升文本分类精度。我们的框架在不访问语言模型的权重或梯度及数据标签的情况下绘制精确决策边界。CELDA的核心思想包含两方面:(1)从无标签数据集中提取精炼的伪标签数据集;(2)在语言模型之上训练轻量且鲁棒的模型,从含噪数据集中学习精确决策边界。通过对多个数据集的深入探究,我们证明了CELDA在弱监督文本分类中达到新最优水平,并缩小了与完全监督模型的差距。此外,我们提出的方法可普适应用于任意语言模型,并能扩展至更大规模模型,使其成为利用大型语言模型的更可行选择。