Text classification is essential for organizing unstructured text. Traditional methods rely on human annotations or, more recently, a set of class seed words for supervision, which can be costly, particularly for specialized or emerging domains. To address this, using class surface names alone as extremely weak supervision has been proposed. However, existing approaches treat different levels of text granularity (documents, sentences, or words) independently, disregarding inter-granularity class disagreements and the context identifiable exclusively through joint extraction. In order to tackle these issues, we introduce MEGClass, an extremely weakly-supervised text classification method that leverages Mutually-Enhancing Text Granularities. MEGClass utilizes coarse- and fine-grained context signals obtained by jointly considering a document's most class-indicative words and sentences. This approach enables the learning of a contextualized document representation that captures the most discriminative class indicators. By preserving the heterogeneity of potential classes, MEGClass can select the most informative class-indicative documents as iterative feedback to enhance the initial word-based class representations and ultimately fine-tune a pre-trained text classifier. Extensive experiments on seven benchmark datasets demonstrate that MEGClass outperforms other weakly and extremely weakly supervised methods.
翻译:文本分类对于组织非结构化文本至关重要。传统方法依赖人工标注或近期使用的类别种子词集进行监督,这在专业或新兴领域可能成本高昂。为此,有研究者提出仅使用类别表面名称作为极弱监督信号。然而,现有方法将不同文本粒度(文档、句子或单词)独立处理,忽视了跨粒度类别不一致性以及仅通过联合提取才能识别的上下文信息。针对这些问题,我们提出MEGClass——一种利用文本粒度互增强的极弱监督文本分类方法。MEGClass通过联合考虑文档中最具类别指示性的单词和句子,获取粗粒度与细粒度上下文信号。该方法能学习到捕获最具判别性类别指示符的上下文感知文档表示。通过保持潜在类别的异质性,MEGClass可选择信息量最大的类别指示文档作为迭代反馈,以增强初始基于单词的类别表示,并最终微调预训练文本分类器。在七个基准数据集上的大量实验表明,MEGClass优于其他弱监督与极弱监督方法。