The eXtreme Multi-label text Classification(XMC) refers to training a classifier that assigns a text sample with relevant labels from an extremely large-scale label set (e.g., millions of labels). We propose MatchXML, an efficient text-label matching framework for XMC. We observe that the label embeddings generated from the sparse Term Frequency-Inverse Document Frequency(TF-IDF) features have several limitations. We thus propose label2vec to effectively train the semantic dense label embeddings by the Skip-gram model. The dense label embeddings are then used to build a Hierarchical Label Tree by clustering. In fine-tuning the pre-trained encoder Transformer, we formulate the multi-label text classification as a text-label matching problem in a bipartite graph. We then extract the dense text representations from the fine-tuned Transformer. Besides the fine-tuned dense text embeddings, we also extract the static dense sentence embeddings from a pre-trained Sentence Transformer. Finally, a linear ranker is trained by utilizing the sparse TF-IDF features, the fine-tuned dense text representations and static dense sentence features. Experimental results demonstrate that MatchXML achieves state-of-the-art accuracy on five out of six datasets. As for the speed, MatchXML outperforms the competing methods on all the six datasets. Our source code is publicly available at https://github.com/huiyegit/MatchXML.
翻译:极端多标签文本分类(XMC)旨在训练一个分类器,从超大规模标签集(例如百万级标签)中为文本样本分配相关标签。本文提出MatchXML,一种针对XMC的高效文本-标签匹配框架。我们发现基于稀疏词频-逆文档频率(TF-IDF)特征生成的标签嵌入存在若干局限性,为此提出label2vec方法,通过Skip-gram模型有效训练语义稠密标签嵌入。利用这些稠密标签嵌入,通过聚类构建层次化标签树。在对预训练编码器Transformer进行微调时,我们将多标签文本分类建模为二分图中的文本-标签匹配问题,进而从微调后的Transformer中提取稠密文本表示。除微调后的稠密文本嵌入外,我们还从预训练的Sentence Transformer中提取静态稠密句子嵌入。最终,通过联合利用稀疏TF-IDF特征、微调稠密文本表示及静态稠密句子特征训练线性排序器。实验结果表明,MatchXML在六个数据集的五个上达到最优精度;在速度方面,MatchXML在所有六个数据集中均优于对比方法。我们的源代码已公开于https://github.com/huiyegit/MatchXML。