Multi-Label Text Classification (MLTC) is a practical yet challenging task that involves assigning multiple non-exclusive labels to each document. Previous studies primarily focus on capturing label correlations to assist label prediction by introducing special labeling schemes, designing specific model structures, or adding auxiliary tasks. Recently, the $k$ Nearest Neighbor ($k$NN) framework has shown promise by retrieving labeled samples as references to mine label co-occurrence information in the embedding space. However, two critical biases, namely embedding alignment bias and confidence estimation bias, are often overlooked, adversely affecting prediction performance. In this paper, we introduce a DEbiased Nearest Neighbors (DENN) framework for MLTC, specifically designed to mitigate these biases. To address embedding alignment bias, we propose a debiased contrastive learning strategy, enhancing neighbor consistency on label co-occurrence. For confidence estimation bias, we present a debiased confidence estimation strategy, improving the adaptive combination of predictions from $k$NN and inductive binary classifications. Extensive experiments conducted on four public benchmark datasets (i.e., AAPD, RCV1-V2, Amazon-531, and EUR-LEX57K) showcase the effectiveness of our proposed method. Besides, our method does not introduce any extra parameters.
翻译:多标签文本分类(MLTC)是一项实用且具有挑战性的任务,其目标是为每个文档分配多个非互斥的标签。先前的研究主要通过引入特殊的标签方案、设计特定的模型结构或添加辅助任务来捕捉标签相关性,以辅助标签预测。最近,$k$最近邻($k$NN)框架显示出潜力,它通过检索已标注样本作为参考,在嵌入空间中挖掘标签共现信息。然而,两个关键偏差——即嵌入对齐偏差和置信度估计偏差——常被忽视,这对预测性能产生了不利影响。本文针对MLTC提出了一种去偏最近邻(DENN)框架,专门设计用于缓解这些偏差。为解决嵌入对齐偏差,我们提出了一种去偏对比学习策略,以增强标签共现上的邻居一致性。针对置信度估计偏差,我们提出了一种去偏置信度估计策略,改进了$k$NN与归纳式二元分类预测的自适应组合。在四个公共基准数据集(即AAPD、RCV1-V2、Amazon-531和EUR-LEX57K)上进行的大量实验展示了我们提出方法的有效性。此外,我们的方法未引入任何额外参数。