Few-shot text classification systems have impressive capabilities but are infeasible to deploy and use reliably due to their dependence on prompting and billion-parameter language models. SetFit (Tunstall et al., 2022) is a recent, practical approach that fine-tunes a Sentence Transformer under a contrastive learning paradigm and achieves similar results to more unwieldy systems. Inexpensive text classification is important for addressing the problem of domain drift in all classification tasks, and especially in detecting harmful content, which plagues social media platforms. Here, we propose Like a Good Nearest Neighbor (LaGoNN), a modification to SetFit that introduces no learnable parameters but alters input text with information from its nearest neighbor, for example, the label and text, in the training data, making novel data appear similar to an instance on which the model was optimized. LaGoNN is effective at flagging undesirable content and text classification, and improves the performance of SetFit. To demonstrate the value of LaGoNN, we conduct a thorough study of text classification systems in the context of content moderation under four label distributions, and in general and multilingual classification settings.
翻译:少样本文本分类系统虽然能力出众,但由于依赖于提示工程和数十亿参数的语言模型,难以可靠地部署和使用。SetFit(Tunstall等人,2022)是一种近期提出的实用方法,它通过对比学习范式微调句子变换器(Sentence Transformer),在效果上堪比更繁琐的系统。在各类分类任务中,应对领域漂移问题至关重要,尤其在检测困扰社交媒体平台的有害内容时,低成本文本分类显得尤为重要。本文提出“像一位好邻居”(LaGoNN)方法,这是对SetFit的一项改进:不引入任何可学习参数,而是利用训练数据中最近邻的信息(例如标签和文本)修改输入文本,使新数据在特征空间上接近模型优化过的实例。LaGoNN在标记不良内容和文本分类方面效果显著,并提升了SetFi的性能。为验证其价值,我们在四种标签分布下的内容审核场景,以及通用与多语言分类任务中,对文本分类系统进行了全面研究。