CELDA: Leveraging Black-box Language Model as Enhanced Classifier without Labels

Utilizing language models (LMs) without internal access is becoming an attractive paradigm in the field of NLP as many cutting-edge LMs are released through APIs and boast a massive scale. The de-facto method in this type of black-box scenario is known as prompting, which has shown progressive performance enhancements in situations where data labels are scarce or unavailable. Despite their efficacy, they still fall short in comparison to fully supervised counterparts and are generally brittle to slight modifications. In this paper, we propose Clustering-enhanced Linear Discriminative Analysis, a novel approach that improves the text classification accuracy with a very weak-supervision signal (i.e., name of the labels). Our framework draws a precise decision boundary without accessing weights or gradients of the LM model or data labels. The core ideas of CELDA are twofold: (1) extracting a refined pseudo-labeled dataset from an unlabeled dataset, and (2) training a lightweight and robust model on the top of LM, which learns an accurate decision boundary from an extracted noisy dataset. Throughout in-depth investigations on various datasets, we demonstrated that CELDA reaches new state-of-the-art in weakly-supervised text classification and narrows the gap with a fully-supervised model. Additionally, our proposed methodology can be applied universally to any LM and has the potential to scale to larger models, making it a more viable option for utilizing large LMs.

翻译：利用语言模型但无需内部访问正在成为自然语言处理领域的有吸引力的范式，因为许多尖端语言模型通过API发布且规模庞大。在此类黑盒场景中的主流方法被称为提示学习，其在数据标签稀缺或不可用的情况下展现出渐进式性能提升。尽管具有有效性，但它们仍逊色于完全监督方法，且通常对轻微修改较为脆弱。本文提出聚类增强线性判别分析，这是一种新颖方法，通过极弱监督信号（即标签名称）提升文本分类精度。我们的框架在不访问语言模型的权重或梯度及数据标签的情况下绘制精确决策边界。CELDA的核心思想包含两方面：(1)从无标签数据集中提取精炼的伪标签数据集；(2)在语言模型之上训练轻量且鲁棒的模型，从含噪数据集中学习精确决策边界。通过对多个数据集的深入探究，我们证明了CELDA在弱监督文本分类中达到新最优水平，并缩小了与完全监督模型的差距。此外，我们提出的方法可普适应用于任意语言模型，并能扩展至更大规模模型，使其成为利用大型语言模型的更可行选择。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

51+阅读 · 2022年10月2日

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

76+阅读 · 2022年6月28日

【Hugging Face】使用自定义数据集微调语义分割模型，Fine-Tune a Semantic Segmentation Model with a Custom Dataset

专知会员服务

21+阅读 · 2022年3月18日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日