Weakly supervised text classification (WSTC), also called zero-shot or dataless text classification, has attracted increasing attention due to its applicability in classifying a mass of texts within the dynamic and open Web environment, since it requires only a limited set of seed words (label names) for each category instead of labeled data. With the help of recently popular prompting Pre-trained Language Models (PLMs), many studies leveraged manually crafted and/or automatically identified verbalizers to estimate the likelihood of categories, but they failed to differentiate the effects of these category-indicative words, let alone capture their correlations and realize adaptive adjustments according to the unlabeled corpus. In this paper, in order to let the PLM effectively understand each category, we at first propose a novel form of rule-based knowledge using logical expressions to characterize the meanings of categories. Then, we develop a prompting PLM-based approach named RulePrompt for the WSTC task, consisting of a rule mining module and a rule-enhanced pseudo label generation module, plus a self-supervised fine-tuning module to make the PLM align with this task. Within this framework, the inaccurate pseudo labels assigned to texts and the imprecise logical rules associated with categories mutually enhance each other in an alternative manner. That establishes a self-iterative closed loop of knowledge (rule) acquisition and utilization, with seed words serving as the starting point. Extensive experiments validate the effectiveness and robustness of our approach, which markedly outperforms state-of-the-art weakly supervised methods. What is more, our approach yields interpretable category rules, proving its advantage in disambiguating easily-confused categories.
翻译:弱监督文本分类(Weakly Supervised Text Classification, WSTC),又称零样本或无数据文本分类,因其仅需少量种子词(类别名称)而非标注数据即可对动态开放网络环境中的海量文本进行分类,近年来受到广泛关注。借助近期流行的提示预训练语言模型(Prompting Pre-trained Language Models, PLMs),许多研究利用人工构建和/或自动识别的语言器(verbalizers)来估计类别的可能性,但未能区分这些类别指示词的影响,更遑论捕捉它们之间的相关性以及根据无标注语料实现自适应调整。本文中,为使PLM有效理解每个类别,我们首先提出一种基于逻辑表达式的新型规则知识形式来刻画类别含义。随后,我们开发了一种基于提示PLM的方法——RulePrompt,用于WSTC任务,该方法包含规则挖掘模块和规则增强伪标签生成模块,并辅以自监督微调模块以促使PLM与任务对齐。在该框架内,分配给文本的不准确伪标签与类别关联的不精确逻辑规则以交替方式相互增强,从而建立以种子词为起点的知识(规则)获取与利用的自迭代闭环。大量实验验证了我们方法的有效性和鲁棒性,其性能显著优于最先进的弱监督方法。更重要的是,我们的方法产生了可解释的类别规则,证明了其在消除易混淆类别歧义方面的优势。