Recent advances in weakly supervised text classification mostly focus on designing sophisticated methods to turn high-level human heuristics into quality pseudo-labels. In this paper, we revisit the seed matching-based method, which is arguably the simplest way to generate pseudo-labels, and show that its power was greatly underestimated. We show that the limited performance of seed matching is largely due to the label bias injected by the simple seed-match rule, which prevents the classifier from learning reliable confidence for selecting high-quality pseudo-labels. Interestingly, simply deleting the seed words present in the matched input texts can mitigate the label bias and help learn better confidence. Subsequently, the performance achieved by seed matching can be improved significantly, making it on par with or even better than the state-of-the-art. Furthermore, to handle the case when the seed words are not made known, we propose to simply delete the word tokens in the input text randomly with a high deletion ratio. Remarkably, seed matching equipped with this random deletion method can often achieve even better performance than that with seed deletion.
翻译:近年来,弱监督文本分类的进展主要集中于设计复杂方法,将高层级人类启发式规则转化为高质量的伪标签。本文重新审视了基于种子词匹配的方法——这可以说是生成伪标签最直接的方式,并证明其能力被严重低估。研究表明,种子词匹配的性能受限主要源于简单种子词匹配规则引入的标签偏差,该偏差阻碍了分类器学习可靠置信度以选择高质量的伪标签。有趣的是,仅需删除匹配输入文本中出现的种子词即可缓解标签偏差,并有助于学习更优的置信度。随后,种子词匹配所能达到的性能得到显著提升,使其与当前最先进的方法持平甚至超越后者。此外,为解决种子词未知的情况,我们提出以高删除比例随机删除输入文本中的词元。值得注意的是,采用这种随机删除方法的种子词匹配往往能获得比种子词删除更优的性能。